Methods for intelligent load balancing and high speed intelligent network recorders

ABSTRACT

A high speed intelligent network recorder for recording a plurality of flows of network data packets into and out of a computer network over a relevant data time window is disclosed. The high speed intelligent network recorder includes a printed circuit board; a high speed network switching device mounted to the printed circuit board; and an X column by Y row array of a plurality of intelligent hard drives with micro-computers mounted to the printed circuit board and coupled in parallel with the high speed network switching device.

CROSS REFERENCE

This patent application is a continuation application claiming thebenefit of U.S. patent application Ser. No. 15/688,847 entitled METHODSFOR INTELLIGENT LOAD BALANCING AND HIGH SPEED INTELLIGENT NETWORKRECORDERS filed on Aug. 28, 2017 by inventors Anthony Coddington et al.U.S. patent application Ser. No. 15/688,847 is a continuationapplication claiming the benefit of U.S. patent application Ser. No.15/145,787 entitled INTELLIGENT LOAD BALANCING AND HIGH SPEEDINTELLIGENT NETWORK RECORDERS filed on May 3, 2015 by inventors AnthonyCoddington et al. U.S. patent application Ser. No. 15/145,787 claims thebenefit of U.S. Provisional Patent Application No. 62/156,885 entitledMETHODS, APPARRATUS, AND SYSTEMS FOR DISTRIBUTED HIGH SPEED INTELLIGENTNETWORK RECORDER filed on May 4, 2014 by inventors Anthony Coddington etal.

This patent application is related to U.S. patent application Ser. No.14/459,748, entitled HASH TAG LOAD BALANCING filed on Aug. 14, 2014 byinventors Karsten Benz et al., with its hash tagging methods andapparatus incorporated herein by reference. U.S. patent application Ser.No. 14/459,748 claims priority to U.S. Patent Application No. 61/973,828filed on Apr. 1, 2014 by inventors Karsten Benz et al.

FIELD

The embodiments generally relate to storing of ingress and egress packetcommunications with networked devices in a local area network.

BACKGROUND

Effective computer security strategies integrate network securitymonitoring. Network security monitoring involves the collection andanalysis of data to help a network administrator detect and respond tointrusions. Accordingly, network security and maintenance are not simplyabout building impenetrable firewalls. Determined attackers mayeventually overcome traditional defenses of a computer network.

The ability to capture and analyze network behavior for incidentdetection of a computer network attack is becoming increasinglychallenging. Incident detection is particularly challenging for networkand security administrators in which the computer network is capable oftransmitting Ethernet frames or packets at a rate of ten gigabits persecond (10 GbE) or higher. Incident detection is also challenging wherea network includes a virtual, hybrid, or cloud architecture.

After an incident of a computer network attack has been detected, it isdesirable to analyze how the attack occurred and what data may have beencompromised or copied from a computer network. There may be some delayin determining when an incident is detected. Accordingly, storage of thedata packet communication into and out of a computer network can beuseful in making a determination of what data was compromised, how thedata was compromised, and whom performed the attack.

Accordingly, it is desirable to store data packet communication with acomputer network to assist in resolving a computer network attack.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments will be described with reference to the Figures, in whichlike reference numerals denote like elements and in which:

FIG. 1A illustrates a block diagram of a data center configured forcentralized packet capture of ingress and egress Ethernet packets andcentralized netflow record collection.

FIG. 1B illustrates a block diagram of a data center configured fordistributed packet capture of ingress and egress Ethernet packets andcentralized netflow record collection.

FIG. 1C illustrates a diagram of an exemplary network data packet thatis recorded as part of a data flow between network addresses.

FIG. 2A illustrates a block diagram of a plurality of intelligentstorage nodes coupled in communication with a high speed switch to formthe basis of a high speed intelligent network recorder.

FIG. 2B illustrates a detailed block diagram of a high speed intelligentnetwork recorder (HSINR) for recording data packets of network flowsover a relevant data time window.

FIG. 2C illustrates a block diagram of a high speed intelligent networkrecorder for the management and distributed storage of IP packets innetwork flows.

FIG. 3A illustrates a block diagram of an instance of an intelligenthard drive with a magnetic disk that may be instantiated into the arrayof intelligent hard drives of the network recorder illustrated in FIG.2B.

FIG. 3B illustrates a block diagram of an instance of an intelligenthard drive with solid state memory that may be instantiated into thearray of intelligent hard drives of the network recorder illustrated inFIG. 2B.

FIG. 3C illustrates a block diagram of an instance of an intelligenthard drive including a plurality of hard drives coupled to onemicro-computer that may be instantiated into the array of intelligenthard drives of the network recorder illustrated in FIG. 2B.

FIGS. 4A-4D illustrate diagrams of relevant data windows along data flowtime lines.

FIG. 5 illustrates a block diagram of an instance of a switchillustrated in FIG. 2B.

FIG. 6A is a perspective view from the bottom of the subassemblieswithin the bottom bay of a high speed intelligent network recorder.

FIG. 6B is a perspective view from the top of the subassemblies in thetop bay of a high speed intelligent network recorder.

FIG. 6C is a perspective view from the side of the high speedintelligent network recorder with the case ghosted out for better viewof the subassemblies.

FIG. 6D is a side view of a storage unit and a control unit togetherforming an alternate embodiment of a high speed intelligent networkrecorder to allow each unit to be located in different server storageracks.

FIG. 7A illustrates a block diagram of an instance of a controller cardbeing plugged into sockets of a backplane printed circuit board.

FIGS. 7B-1 and 7B-2 (collectively FIG. 7B) illustrate a functional blockdiagram of a portion of the controller card shown in FIG. 7A that may beplugged into a high speed intelligent network recorder.

FIG. 7C illustrates a block diagram of an instance of a storage drivebeing plugged into a drive tray that is in turned plugged into socketsof the backplane printed circuit board.

FIG. 8A illustrates a block diagram of a high speed intelligent networkrecording system within a portion of a local area network.

FIG. 8B illustrates a block diagram of an intelligent load balancingcard.

FIG. 9 illustrates a block diagram of an intelligent load balancerproviding intelligent load balancing to a plurality of nodes ofintelligent network storage.

FIG. 10 illustrates a block diagram of data packet flow and processingwith intelligent load balancing by an intelligent load balancer in theintelligent network recording system.

FIG. 11 illustrates a functional block diagram of intelligent loadbalancing by the intelligent load balancer.

FIGS. 12A-12D illustrate bandwidth charts for various load balancingconditions that may occur in the intelligent network recording system.

FIG. 13 illustrates a diagram of the process of cold node assignment bya cold node assignment lookup table by the intelligent load balancer.

FIGS. 14A-14B illustrate the process of hot balancing with a hotbalancing weighting algorithm by the intelligent load balancer.

FIGS. 15A-15D illustrate the process of cold bin movement rebalancing bythe intelligent load balancer.

FIG. 16 illustrates the process of cold bin movement rebalancing by theintelligent load balancer.

FIG. 17 illustrates the process of node availability change by theintelligent load balancer.

FIGS. 18A-18B illustrate the process of the node status and discoverysubsystem performed in the intelligent load balancer.

FIG. 19 illustrates the process of the count-min sketch algorithmperformed in the intelligent load balancer.

FIG. 20 illustrates a query process that may be performed on the nodesin the intelligent network recording system.

FIG. 21 illustrates a diagram to calculate a minimum approximatebandwidth threshold.

FIG. 22 shows details of a query return process.

FIG. 23A shows a record capture flow within an intelligent hard drive.

FIG. 23B shows one embodiment of messages transmitted by broken flowmessaging system by way of an example broken flow split across threenodes.

FIG. 24 shows a broken flow reassembly process for back-testing.

DETAILED DESCRIPTION

In the following detailed description of the embodiments, numerousspecific details are set forth in order to provide a thoroughunderstanding of the embodiments. However, it will be obvious to oneskilled in the art that the embodiments may be practiced without thesespecific details. In other instances well known methods, procedures,components, and circuits have not been described in detail so as not tounnecessarily obscure aspects of the embodiments. A device describedherein is hardware, software, or a combination of hardware and software.

INTRODUCTION

The ability to collect and access packet-level data is important toanalyzing the root cause of network issues. However, capturing all ofthe data packets on high-speed networks can prove to be challenging. Tohelp overcome these issues, high speed intelligent network recorders(HSINRs), are provided that include an array of intelligent hard drives,among other devices. The HSINRs provide a network recording of datapackets with minimal latency, regardless of packet size, interface type,or network load. In particular, the HSINRs use hash tags in a way thatsimplify a load balancing scheme for recording data packets for dataflows within a data center computer network.

The description herein includes a general overview of a data center, therole of HSINRs within a data center, and details of load balancingoperations of network flows.

Data Center Computer Network Overview

Referring now to FIG. 1A, a block diagram of an exemplary data centercomputer network 100A is shown. The data center computer network 100Amay include, without limitation, a router 168, a firewall 166, a tap400, an intelligent load balancer (ILB) 801, a high speed intelligentnetwork recorder (HSINR) 170, netflow generators 180A-180D, netflowcollectors (NFCs) 162A-162D, a central NFC 164, a network switch 110A,one or more servers 112A-112B, one or more tiered storage appliances114A-114B, one or more storage array appliances 116A-116B, and one ormore flash appliances 118 coupled together by one or more high speednetworking cables (e.g., Ethernet wired cables 111A-111B, Fibre Channeloptical cables 113A-113G) to form a local computer network 101A, oftenreferred to as a local area network (LAN) 101A.

To store ingress and egress network internet protocol (IP) packets(e.g., Ethernet packets) between the computer network 101A and theinternet cloud (wide area network) 102, the local computer network 101Aincludes a high speed intelligent network recorder (HSINR) 170. Tobalance the load of storing packets into a plurality of storage devicesin the HSINR 170, the network 101 includes the intelligent load balancer(ILB) 801. To analyze the stored packets, the HSINR 170 may couple to ananalyzer 156L that provides a query agent. Alternatively, a query agentmay be included as part of the ILB 801.

Each NGA 180A-180D is coupled to the tap 400 to receive ingress andegress IP packets. Each NGA 180A-180D may be coupled to the switch 110A.Each NGA 180A-180D analyzes the ingress and egress IP packets itreceives and generate netflow records that summarize a computercommunication between IP addresses. The netflow records may be routed toa plurality of NFCs 162A-162D. Each NFC 162A-162D is coupled to thenetwork switch 110A and the central NFC 164 that can merge netflowrecords together.

A pair of computer servers 112A-112B are connected to the network switch110A via Ethernet cables 111A-111B terminating in Ethernet cards120A-120B installed on the servers 112A-112B to communicate using anEthernet communication protocol. The computer servers 112A-112B mayfurther have Fibre Channel host bus adapter cards 122A-122B respectivelyinstalled into them to communicate using a Fibre Channel communicationprotocol.

In one embodiment, a target network device (also referred to herein as astorage target) includes Fibre Channel cards 124A-124C installed toreceive signals, including a storage request, from the servers 112A-112Boff of wires or cables, such as Fibre Channel cables 113C-113D. Thetarget network device may be one of the tiered storage arrays 114A-114B,the storage arrays 116A-116B, or the flash appliance 118 (referred tocollectively as storage array appliances). Fibre Channel cards 124A,124B, 124E, 124F, and 124G may be installed in the storage arrayappliances 114A, 114B, 116A-116B and 118.

The servers 112A-112B have Fibre Channel host bus adapters 122A-122Bthat are coupled to the Fibre Channel cards 124A-124C, 124D-124G in thestorage array appliances 114A-114B, 116A-116B and 118. The Fibre Channelhost adapters 122A-122B may differ somewhat from the Fibre Channel cards124A-124B, 124E-124G because the server 112A,112B is an initiator andthe storage array appliances 114A-114B, 116A-116B, 118 are targets.

In some embodiments, the connections between servers 112A-112B and thestorage array appliances 114A, 114B, 116A, and 116B are via fiber cables113A, 113B, 113E, 113F, and 113G that terminate at one end at the FibreChannel cards 118,124A, 124B, 124C, 124E, 124F, and 124G of the storagearray appliances 114A, 114B, 116A and 116B.

One or more clients 150A-150N in a client-server network 100A mayinterface with the local computer network (data center) 101A over a widearea network (WAN) 102, such as the Internet or World Wide Web. The oneor more clients 150A-150N may desire one or more server functions of theservers 112A-112B for software applications and/or storage capacityprovided by the storage arrays or appliances 114A-114B, 116A-116B, 118to store data. Servers/storage arrays in the data center 101A cancommunicate with the one or more remotely located clients 150A-150N overthe WAN 102.

One or more malicious clients 152A-152N may pose a security threat tothe data center computer network 100A. Accordingly, a user (e.g.,network administrator) can manage security of the data center computernetwork 100A via tools, such as with a local analyzer 156L or a remoteanalyzer 156R. A local analyzer 156L may be coupled to the HSINR 170 orto the one or more NFCs 162A-162D, 164. A management console 158,including a monitor and a keyboard may be coupled to the local analyzer156L from which the computer network can be managed by a user.Alternatively, the user can manage security of the data center computernetwork 100A remotely over the Internet cloud 102. For example, the usercan manage security of the data center computer network 100A via tools,such as a remote analyzer tool 156R and a remote management console 154,including a monitor and keyboard. The remote analyzer 156R and theremote management console 154 are in communication with the one or moreNFCs 162A-162D, 164 and/or the HSINR 170.

FIG. 1B is a block diagram of another example data center computernetwork 100B. The data center computer network 100B is similar to thedata center computer network 100A of FIG. 1A. However, the data centercomputer network 100B of FIG. 1B includes a switch 110B that is locatedbetween the firewall 166 and two taps 400 and 400′. The network 100Bincludes a pair of intelligent load balancers ILB1 801 and ILB2 801′, aswell as a pair of high speed intelligent network recorders HSINR1 170and HSINR2 170′ respectively coupled to the pair of ILBs 801,801′. Oneor more local analyzers 156L with a query agent 2000 may be coupled topair of high speed intelligent network recorders HSINR1 170 and HSINR2170′ to analyze the network traffic. Alternatively, a query agent 2000may be included as part of each of the ILBs 801,801′.

The switch 110B is coupled to the firewall 166, tap 400, tap 400′, NGA180, and NGA 180′. The first tap TAP1 400 is also coupled to theintelligent load balancer ILB1 801 which is in turn coupled to the highspeed intelligent network recorder HSINR1 170. The second tap TAP2 400′is coupled to the intelligent load balancer ILB2 801′ which is in turncoupled to the high speed intelligent network recorder HSINR2 170′. TheNGA 180 is coupled to the tap 400, the switch 110B, and the NFC 162. NGA180′ is coupled to the tap 400′, the switch 110B, and NFC 162′. NFC 162and NFC 162′ are coupled to the switch 110B and the central NFC 164.

Other devices of the data center computer network 100B may be similar tothe devices of the data center computer network 100A of FIG. 1A.

Network Data Flows and Ethernet Packets

FIG. 1C is a diagram illustrating an example network data packet 1102,such as an Ethernet packet. The Ethernet packet 1102 includes a headerfield and a data field. The header field of the Ethernet packet 1102includes a destination or receiver media access control (MAC) address, asource or sender MAC address, and a field for other header informationsuch as ether-type.

The data field of the Ethernet packet 1102 includes an IP packet 1104,which includes a header field and a data field. The header field of theIP packet 1104 includes a version field, a header length field, a typeof service (ToS) field, a total length field, a packet identifier, atime to live (TTL) field, a protocol field 1108, a header checksum, asource IP address 1110, and a destination IP address 1112.

To form a record, additional fields may be inserted into the headerfield or data field of the Ethernet packet 1102; or the header field ordata field of the IP packet 1104. For example, a time stamp 1003, a flowhash 1005, and a record length 1004 may be pre-pended to the header ofthe Ethernet packet 1102 as shown. The Ethernet packet 1102 with thisadded information may be re-encapsulated to transmit one or more recordsover a network from one network device to another, for example, into thedata field of the IP packet 1104. Further information may be added tothe IP packet 1104 during processing of the record, such as a hot/coldflag 1090, and/or other meta data 1091 such as a logical unit number(LUN) or disk identifier of a storage device, for example.

The data field of the IP packet 1104 may include one or more oftransmission control protocol (TCP) packets, user datagram protocol(UDP) packets, or stream control transmission protocol (SCTP) packets.FIG. 1C illustrates a transmission control protocol (TCP) packet 1106including a header field and a data field. The header field of the TCPpacket 1106 includes a source port number 1114, a destination portnumber 1116, a send number, an acknowledgement number, one or moreflags, and a checksum. A plurality of TCP packets between the same IPaddresses and port numbers may be grouped together, to form a networkflow.

Network traffic into and out of a data center or local area network isorganized into network flows of network packets forming conversationsbetween processes or computers. A network flow is one or more networkdata packets sent over a period of time for a given communicationsession between two internet protocol (IP) addresses. A network flowrecord (netflow record) may be generated to summarily identify thenetwork flow of network data packets between two devices associated withthe two internet protocol (IP) addresses.

Devices which analyze these conversations require access primarily tothe first group of N packets, perhaps twenty or thirty packets forexample, in a network flow. Some analysis of conversations will find thefirst N packets sufficient (for example, application detection). Howeversome analysis of conversations will require all the flow packets (forexample, a SNORT analysis). Unfortunately, network flows are notuniform.

Network flows vary widely in size from conversation to conversation.Network flows with a data bandwidth smaller than a certain bandwidththreshold are referred to herein as being a cold flow, cold traffic, orjust cold. Network flows with a bandwidth greater than or equal to thebandwidth threshold are referred to herein as being a hot flow, hottraffic, or just hot.

A network flow is identified by the end points which are communicatingvia the network flow. However, the number of specific details and thesize of the specific details that identify the endpoints depend on theprotocol the endpoints are using to communicate. For example, a webserver and client communicating over an IPv4 TCP connection will becharacterized by a pair of IPv4 32-bit IP addresses, a pair of 16-bitports and the ethertype used. However, a similar communication over anIPv6 TCP connection will require 128-bit IPv6 addresses. A non-IPcommunication may be identified by MAC addresses.

In order to refer to all network flows equally, a hash is formed overthe characterizing identifiers, referred to as a flowhash 1005. Theflowhash is a pseudo-random number generated in response to the fields(e.g., a source IP address 1110, a destination IP address 1112, a sourceport number 1114, a destination port number 1116, in an Ethernet packet1102, an IP packet 1104, and a TCP packet 1106, that are encapsulatedtogether as one for example. U.S. patent application Ser. No. 14/459,748describes a method of generating hash tags for netflow records, forexample. The data bit width of a flowhash may be 24, 32 or 56 bits, forexample.

The timestamp 1003 added to each packet of the flows can in a uniformmanner identify the different dates and times the packets are receivedby a network device, for example, such as at probe or tap in a datacenter or a local area network.

High Speed Intelligent Network Recorder Functions

High speed intelligent network recorders (HSINR) are a part of a networkmonitoring infrastructure. High speed intelligent network recorders cancapture and store network traffic at wire speed without packet loss. Ahigh speed intelligent network recorder can store days, or weeks ofnetwork flows of data between devices depending upon how much storage isavailable.

High speed intelligent network recorders (HSINR) unobtrusively monitorevery packet on network links, simultaneously adding a time/date stampinto each packet and storing a copy of each packet into memory and theninto a hard drive. Similar to a database, network operators can queryand search through the stored data packets in the high speed intelligentnetwork recorder to quickly isolate issues that might be impactingnetwork performance and security. A network flow of packets can beplayed back to analyze the traffic in greater detail. The high speedintelligent network recorder is a massively parallel distributedprocessor and data storage device.

Instead of packets, a data base of data fields may be stored in a highspeed intelligent network recorder with an index to accelerates searchesand function as a high speed data base server. An array of intelligenthard drives in the high speed intelligent network recorder can be usedto perform data base operations on the data (or data packets) storedtherein. One such data base operation is a network search for example.

With every captured packet being time stamped, a high speed intelligentnetwork recorder can accurately replay stored data while maintaininginter-packet delay intervals, guaranteeing recreation of the originallymonitored network traffic. Network operators can replay the storednetwork traffic to see events on a network as it occurred, providing theability to recreate real network scenarios, identify the cause andeffect of alarm conditions, load and test networking and securityequipment and actively study user experiences for services, such as livevideo on demand, for example.

FIG. 2A illustrates a conceptual diagram of a high speed intelligentnetwork recorder 200 with an intelligent storage array 220 of aplurality of intelligent storage nodes 260AA-260XY coupled incommunication with a high speed switch 202. An intelligent storage node260 is a system of storage resources including a connection device 262,a microcomputer 210 or portion of computer cycles thereof, and at leastone hard drive 212 coupled together. The ratio of hard drive storagedevices 212 to the microcomputer 210 may be one to one (1:1) or aplurality to one (e.g., N:1). One microcomputer 210 may be shared by aplurality of nodes 260 using a plurality of software processes.

Each of the plurality of storage nodes 260AA-260XY are coupled inparallel to one or more high speed network switching devices 202 by oneor more high speed networking cables 232 (e.g., Ethernet or FibreChannel communication protocols over optical or wire cables). The one ormore high speed network switching devices 202 are considered to be apart of the high speed intelligent network recorder 200. The one or morehigh speed network switching devices 202 may be coupled to a localstorage area network by another set of one or more high speed networkingcables 282 (e.g., Ethernet or Fibre Channel communication protocols overoptical or wire cables).

High Speed Intelligent Network Recorder Architecture

Referring now to FIG. 2B, a functional block diagram of a high speedintelligent network recorder (HSINR) 200 that may be used to implementthe high speed intelligent network recorder (HSINR) 170,170′ shown inFIGS. 1A-1B. The HSINR 200 includes a printed circuit board 299 with anX by Y intelligent storage array 220 of a plurality of intelligent harddrives 201AA-201XY coupled in parallel to a high speed network switchingdevice 202. The printed circuit board 299 includes a plurality of wires,or printed circuit board traces 251,252A,252B,254 to propagate networksignals, including Ethernet or network data packets, between devices forstorage into the array of the plurality of intelligent hard drives201AA-201XY. Alternatively, wire cables or optical cables 282,232 may beused to propagate network signals, including Ethernet or network datapackets, between devices for storage into the array of the plurality ofintelligent hard drives.

The high speed network switching device 202 may be mounted to or pluggedinto the printed circuit board 299 and coupled to the wires or PCBtraces 254 of the printed circuit board so that it is in communicationwith the plurality of intelligent hard drives 201AA-201XY.Alternatively, the high speed network switching device 202 may be aseparate device that couples to the plurality of intelligent hard drives201AA-201XY via wire cables or optical cables 232 as shown in FIG. 2C toform the HSINR.

In FIG. 2B, the high speed network switching device 202 is coupled to anetwork to receive a plurality of flows of network data packets to andfrom one or more network devices in the network. The high speed networkswitching device 202 may couple to the intelligent load balancer801,801′ (or the tap 400,400′ without an ILB) shown in FIGS. 1A-1B bywire cables or optical cables 282. The high speed network switchingdevice 202 may couple to one or more query agents, or analyzers by aplurality of wire cables or optical cables 252A,252B.

Each of the plurality of intelligent hard drives 201 may include amicro-computer 210 and one or more hard drive storage devices 212, suchas a magnetic disk drive or a solid state storage drive (SSD), coupledin communication together. The ratio (D to M ratio) of hard drivestorage devices 212 to microcomputer devices 210 may be one to one(1:1); a plurality to one (e.g., D:1); or a plurality to two (e.g.,D:2).

The high speed intelligent network recorder (HSINR) may have differentform factors, such as a two rack unit (2 U) form factor, a three rackunit (3 U) form factor, a four rack unit (4 U) form factor, or a sixrack unit (6 U) form factor.

The number of hard drive storage devices 212 in the array 220 rangesfrom 100 to 2000. In one embodiment, there are 350 hard drive storagedevices in the array 220, each having a capacity of about two terabytessuch that there is approximately 700 terabytes of storage capacity inthe array of the plurality of intelligent hard drives 201AA-201XY. Insome embodiments, the hard drive storage devices have a small formfactor, such as a 2.5 inch form factor of laptop drives, and a SATAinterface plug.

The hard drive storage device 212 is a pluggable drive that plugs into asocket 211 such that the length of the hard drive is perpendicular tothe printed circuit board 299. The socket 211 and microcomputer 210 aremounted to the printed circuit board 299. The socket 211 is coupled incommunication to the microcomputer 210 through wire traces 251 of thePCB 299.

The hard drive storage device 212 includes one or more drivercontrollers (see driver controller 313 in FIGS. 3A-3B) that supportsself-monitoring, analysis, and reporting technology (SMART). U.S. Pat.No. 6,895,500 issued May 17, 2005 to inventor Michael Rothberg,incorporated herein by reference, discloses a magnetic disk drive withSMART and how it may be used. The controller can report SMART attributesabout the hard drive storage device such as head flying height, remappedsector quantity, corrected error counts, uncorrectable error counts,spin up time, temperature, and data throughput rate. Other SMARTattributes can also be reported.

Information may be used to predicted advanced failure. For example, adrop in head flying height often occurs before a head crashes onto thedisk or platter. Remapped sectors occur due to internally detectederrors. A large quantity of remapped sectors can indicate the drive isstarting to fail. Correctable error counts if significant and increasingcan indicate that the drive is failing. A change in spin-up time of adisk or platter, usually an increase, can indicate problems with aspindle motor that spins the disk or platter. Drive temperatureincreases may also indicate spindle motor failure. A reduction in datathroughput, can indicate an internal problem with the hard drive. In anycase, the controller can provide an indication of advance failure ofwriting and reading to a hard drive that is useful in an array ofintelligent hard drives.

Referring now to FIG. 2C, a block diagram of a plurality of storageprocessing units 250A-250T to form the intelligent storage array 220 inthe HSINR 200. Each storage processing unit 250 of the plurality mayinclude one or more drive controllers 238 that are coupled to the one ormore storage hard drives 212. Each storage processing unit 250 mayfurther include one or more microcomputers 236 coupled to the one ormore drive controllers 238 through a connector (e.g., Com Express 2.1connector) plugged into a socket (e.g., a Com Express 2.1 socket)mounted on the backplane printed circuit board. The one or more drivecontrollers may be one or more SATA drive controllers 238 which arecoupled to the one or more SATA storage hard drives 212.

A plurality of storage processing units 250A-250T are coupled inparallel to one or more high speed network switching devices 202 by oneor more high speed networking cables 232 (e.g., Ethernet cables or FibreChannel cables). The one or more high speed network switching devices202 are further coupled to a local area network by one or more highspeed networking cables 282 (e.g., Ethernet or Fibre Channel).

A plurality of microcomputers 236 couple to a gigabit switch 242 by anetworking cable (e.g., Ethernet cable 257) for the management andcontrol. The gigabit switch 242 is coupled to the high speed networkswitching device 202 by a high speed networking cable 240 and therebycoupled to the local area network. Queries may be multicasted inparallel to each of the microcomputers 236 in the high speed intelligentnetwork recorder through the switch 242. Alternatively, unique queriesmay be made to a microcomputer 236 and data stored on a hard drive thatis under its control.

Intelligent Hard Drives

Referring now to FIG. 3A, a block diagram of an intelligent hard drive201 is illustrated that may be instantiated as the plurality ofintelligent hard drives 201AA-201XY in the high speed intelligentnetwork recorder (HSINR) 200,170,170′.

The intelligent hard drive 201 includes a microcomputer 210 coupled to ahard drive storage device 212, a magnetic hard disk drive. The magnetichard disk drive 212 includes a magnetic disk drive controller 313A andone or more read/write heads 315 coupled together. The magnetic harddisk drive 212 further includes one or more magnetic platters 317 thatare rotated by an electric motor 318. The one or more read/write heads315 are pivoted together over the one or more magnetic platters 317 byan electric motor 319. U.S. Pat. No. 6,895,500 issued on May 17, 2005 toMichael S. Rothberg discloses further exemplary information regarding amagnetic hard drive and is incorporated herein by reference.

The mechanical motions of the magnetic disks and the read/write heads ofa magnetic hard disk drive can cause the drive 212 to vibrate. With alarge array of intelligent hard drives 201AA-201XY, the vibrations canbe significant such that it may be desirable to dampen the vibrations toreduce the stress on the socket 211 (e.g. SATA socket) of the PCB andplug (e.g. SATA connector) of the hard drive as well as othercomponents.

Each intelligent hard drive 201,201′ in the array may further include anelastic bumper 312 around a portion of the magnetic hard disk drive 212to dampen vibrations from the one or more rotating magnetic plattersand/or the one or more moveable read/write heads. In the networkrecorder, each elastic bumper forms an array of elastic bumpers aroundthe array of magnetic hard disk drives.

The micro-computer 210 of the intelligent hard drive 201 includes aprocessor 301, a memory 302 coupled to the processor, a networkinterface adapter/controller 303 coupled to the processor, a SATAinterface controller 304 coupled to the processor, and a storage device305 coupled to the processor.

The network interface adapter/controller 303 is coupled to the highspeed network switching device 202 to send and receive network flows ofnetwork data packets. The network interface adapter/controller 303 iscoupled to the processor 301 to pass network flows of network datapackets to it. Upon a query, the processor 301 may pass network flows ofnetwork data packets to the network interface adapter/controller 303.

The network interface adapter/controller 303 may optionally be aseparate device coupled to and between each processor 301 of theplurality of intelligent hard drives 201 and the high speed networkswitching device 202.

Referring now to FIG. 3B, a block diagram of an intelligent hard drive201′ that may be instantiated as the plurality of intelligent harddrives 201AA-201XY in the high speed intelligent network recorder(HSINR) 170,170′.

The intelligent hard drive 201′ includes a microcomputer 210 coupled toa hard drive storage device 212′, a solid state storage drive (SSD)212′. The hard drive storage devices 212 and 212′ have the same formfactor so they are interchangeable. The solid state storage drive 212′includes a solid state drive controller 313B coupled to a plug 311 and anon-volatile memory array 357 including a plurality non-volatile memorydevices 327A-327N coupled to the solid state drive controller 313B. Theplug 311 (e.g., SATA connector) of the solid state storage drive (SSD)212′ couples to the socket 211 mounted to the printed circuit board 299.

Because the solid state storage drive 212′ has no moving parts, thereare no vibrations generated by each. Accordingly, an elastic bumperaround a solid state storage drive is not needed to dampen vibrations.

The storage array 202 of an HSINR 200 has a storage drive D tomicrocomputer M ratio (D to M ratio) greater than or equal to one. FIGS.3A-3B illustrate a one to one ratio of hard drive 212,212′ tomicro-computer 210 in each intelligent hard drive. FIG. 3C illustrates asix to one ratio of hard drives to micro-computer in an intelligent harddrive 201″. Generally, a D to one ratio (or a 2D to two ratio) of harddrives to micro-computer is used, where D is a predetermined variable.That is, 2D hard drives may be served by 2 microcomputers in eachintelligent hard drive and instantiated multiple times as the pluralityof intelligent hard drives 201AA-201XY in the high speed intelligentnetwork recorder (HSINR) 170,170′.

For example, the intelligent hard drive 201″ of FIG. 3C includes aplurality of hard drives 212A-212F and a micro-computer 210 coupled incommunication together. The intelligent hard drive 201″ may beinstantiated multiple times as the plurality of intelligent hard drives201AA-201XY in the high speed intelligent network recorder (HSINR)170,170′.

Each of the plurality of hard drives 212A-212F of the intelligent harddriver may be a magnetic disk drive 212 or a solid state storage drive212′.

High Speed Network Switching Device

FIG. 5 illustrates a block diagram of an instance of the high speedswitch 202. The high speed switch 202 includes an N to N cross pointswitch 501, a controller 502 with a program memory 512, a plurality ofnetwork interfaces (NI) 503A-503N, 504A-504N, 505, 506, and a pluralityof buffer memories 513, 514, 515, 516. The controller 502 is coupled toeach to control the switching of packets in the switch and within thehigh speed intelligent network recorder (HSINR) 170,170′. Instructionsstored in the program memory 512 cause the controller 502 to generatecontrol signals and control the switch.

The network interfaces 503A-503N, 504A-504N coupled to the plurality ofintelligent hard drives 201AA-201XY in the high speed intelligentnetwork recorder (HSINR) 170,170′. The network interface 505 couples toother networking equipment in the data center computer network 100A,100Bsuch as the intelligent load balancer 801, and/or tap 400,400′ shown inFIGS. 1A-1B. The network interface 506 may couple to other networkingequipment in the data center computer network 100A,100B such as theanalyzers 156L shown in FIGS. 1A-1B.

In response to control signals from the controller 502, the N to N crosspoint switch 501 switches packets between the devices coupled to thenetwork interfaces 503A-503N, 504A-504N. A packet, such as a commandpacket for example, may be multicasted to network interfaces503A-503N,504A-504N so that the plurality of intelligent hard drives201AA-201XY in the high speed intelligent network recorder (HSINR)170,170′ may act together.

The buffers 513-516 temporarily buffer packets for switching through thecross point switch 501.

Relevant Data Time Windows

Storage of the data packet communication into and out of a computernetwork can be useful in making a determination of what data wascompromised, how the data was compromised, and whom performed theattack. However, with incident detection in a computer network, storeddata packet communications can become less relevant and useful overtime. Accordingly, the more recent data packet communication is moredesirable to store into the capacity of the array of the plurality ofintelligent hard drives 201AA-201XY.

The array of the plurality of intelligent hard drives 201AA-201XY shownin FIG. 2B records a plurality of flows of network data packets,eventually filling the capacity of the aggregated plurality of the arrayof intelligent hard drives 201AA-201XY, without the need for dataredundancy such as may be found in a redundant array of independentdisks (RAID). The capacity of the array of intelligent hard drives201AA-201XY defines a relevant data time window of network data packets(ingress and egress) associated with network flows. In contrast tobackup data, the network data packets stored in the array of intelligenthard drives 201AA-201XY are readily accessible for queries and analysisduring the relevant data time window, which may be performed in adistributed manner by the processors or microcomputers 210. Generally,data need not be recovered by backup software before it can be used oranalyzed.

FIGS. 4A-4D illustrate drawings of relevant data time windows 400A-400Dfor the array of intelligent hard drives 201AA-201XY along a data timeaxis. The capacity of the array of intelligent hard drives 201AA-201XYcan be represented by the relevant data time windows 400A-400D along thedata time axis Td. The network data flows that are stored into the arrayof intelligent hard drives 201AA-201XY, the relevant stored data, isrepresented by the data flows 402A-402B within the relevant data timewindows 400A-400D.

In FIG. 4A, the more recent stored network data flows 402B are theyoungest data packets stored into the array of intelligent hard drivesrepresented by being nearest one end of a relevant data time window400A. The earlier stored network data flows 402A are the oldest datapackets stored into the array of intelligent hard drives represented bybeing nearest an opposite end of the relevant data time window 400A.

In FIG. 4B, after the capacity of the array of intelligent hard drivesis filled with relevant stored data the first time, the oldest storednetwork data flows 403A are over written with newer stored network dataflows 403B. This is represented by the oldest stored network data flows403A falling outside one end of the relevant data time window 400B withthe newer stored network data flows 403B being included within therelevant data time window 400B near the opposite end.

In FIG. 4C, after a period of time of storing a plurality of networkdata flows, relevant stored data 412 in the array of intelligent harddrives is represented by network data flows within the relevant datatime window 400C. Past network data flows 413 that have been overwrittenare shown outside the relevant data time window 400C at one end. Futureexpected network data flows 415 are shown outside the relevant data timewindow 400C at the opposite end.

The capacity of the array of intelligent hard drives may change overtime. Sectors within one or more hard drives may be mapped out frombeing written to. One or more sectors may have uncorrectable errors suchthat they are unreadable. One hard drive may completely fail so that itsentire capacity may be unavailable and eliminated from the capacity ofthe array of intelligent hard drives.

To keep maintenance costs low and to avoid powering down the high speedintelligent network recorder, and thereby continue to record networkdata flows of data packets, intelligent hard drives that fail within thearray may not be replaced. Furthermore, advanced notice of failingsectors and hard drives may be obtained by self-monitoring, analysis andreporting technology (SMART) data for each intelligent hard drive in thearray. With advanced notice, new network data flows that are to bestored in the network recorder can avoid being stored into failingsectors or failing hard drives. The failing sectors or failing harddrives may still be readable. In which case, older stored network dataflows may be stored in the failing sectors or failing hard drives andbecome more and more irrelevant as time passes.

In FIG. 4D, the capacity of the array of intelligent hard drives hasdecreased such as from a failing sector or a failing hard drive. This isindicated by the relevant data time window 400D being narrower withfewer relevant network data flows 412′ being stored within the window400D. Data flows no longer available within the array may be deemed lostnetwork data flows 416. Lost network data flows 416, previously withinthe relevant data time window 400C, is now shown outside the relevantdata time window 400D along with the past network data flows 413.

The capacity of the array of intelligent hard drives may initially besized to store days, weeks, or months worth of expected network dataflows of data packets into and out of a computer network betweendevices. The number of intelligent hard drives in the array may number960 for example and store two weeks worth of network data flows. If onehard drive is lost, only a portion of the two weeks of stored networkdata flows is lost, such as a few hours during a day. It may not beworth the maintenance costs and lost recording time to recover a fewhours of lost data capacity. Accordingly, the failing or failedintelligent hard drives are not replaced, thereby lowering maintenancecosts of the network recorder. If failed intelligent hard drives havereduced the capacity of the array of intelligent hard drives to anunacceptable level, a tray of hard drives may be replaced.Alternatively, the entire network recorder may be replaced. The harddrives that are available at a later time may be lower in costs anddeveloped with greater density (fewer dollars per gigabyte), such thatthe network recorder may be larger in capacity and lower in costs whenreplaced.

High Speed Intelligent Network Recorder Implementation

FIGS. 6A-6D and 7A-7C illustrate implementation details for an exemplaryembodiment of the high speed intelligent network recorder 200 shown inFIGS. 2A-2C and 3B-3C. FIGS. 6A-6D and 7A-7C illustrate embodiments of ahigh speed intelligent network recorder 170,170′, 170″ with 350 storagedrives 212,212′ and 20 microcomputers 750 providing a drive D tomicrocomputer M (D to M) ratio of 35 to 2. The storage array 202 ofstorage drives 212,212′ (as well as each drive tray) may be arrangedinto X columns and Y rows of hard drives in order to efficiently use thespace and provide the appropriate drive D to microcomputer M (D to M)ratio for the high speed intelligent network recorder. For example, whenviewed from the front of a rack, the 350 storage drives in theintelligent storage array 202 may be arranged as 35 columns by 10 rowsof storage drives 212,212′ mounted perpendicular to a backplane printedcircuit board (PCB).

FIG. 6A illustrates a computer enclosure 614 for the high speedintelligent network recorder 170,170′ viewed from the bottom. In oneembodiment, the computer enclosure 614 is a 6 U sized metal computerenclosure. The metal computer enclosure 614 is utilized for mountingcomputing and storage resources into two bays, a top bay 615 and abottom bay 617. A backplane printed circuit board (backplane PCB orbackplane) 612 is mounted in the enclosure 614 between the top bay 615and a bottom bay 617. The top bay 615 of the high speed intelligentnetwork recorder receives hard drive trays 618. The bottom bay 617 ofthe high speed intelligent network recorder receives controller cards608. Each side of the backplane PCB 612 includes a plurality of sockets606. Sockets 606 on one side of the backplane 612 receive connectors ofcontroller cards 608. Sockets 606 on the opposite side of the backplanePCB 612 receive connectors of hard drive trays 618 (see FIGS. 6C-6D).Sockets 606 aligned with each other on opposing sides of the backplaneare coupled in communication together so that controller cards 608 onone side are coupled in communication to hard drive trays 618 on theopposite side. The backplane PCB 612 provides power and ground to theeach controller card 608 and hard drive tray 618. The backplane PCB 612may further provide a communication connection to each processor of eachcontroller card 608.

In the bottom bay 617, one or more controller cards 608 plug into theone or more sockets 606 of the backplane PCB 612 on one side. Whenplugged in, the controller cards 608 are perpendicular with thebackplane PCB 612.

In the top bay 615, one or more hard drive trays 618 (see FIG. 6B) pluginto one or more sockets 606 of the backplane PCB 612 on the oppositeside. When plugged in, the hard drive trays 618 are parallel with thebackplane PCB 612.

In one embodiment, each of the one or more sockets 606 are a Com Express2.1 connector mounted on the backplane 612. The controller cards 608 andthe hard drive trays 618 have the same form factor and position ofconnectors 606 so that each may be interchangeable in the backplane 612to provided different system configurations.

The controller card 608 includes a printed circuit board (PCB) 602 withone or more processing units 750A,750B mounted to it. The controllercard 608 is mounted to the computer enclosure 614 by a partitioningframe 610.

As discussed herein, the high speed intelligent network recorder170,170′ includes a high speed switch 202. The high speed switch 202 maybe an integrated circuit chip mounted to the backplane printed circuitboard 612. Alternatively, the high speed switch 202 may be a daughtercard with a connector that plugs into a socket of the backplane printedcircuit board 612. In another case, the high speed switch 202 may be aseparate device that is mounted to the enclosure 614 or a separatedevice with its own 1 U enclosure that is adjacent the array 220 andcoupled in communication with the intelligent hard drives by wire cablesor optical cables. The cables 282 from switch 202 couple the high speedintelligent network recorder 170,170′ to the ILB or tap within thenetwork.

FIG. 6B illustrates the computer enclosure 614 viewed from the top. Inthe top bay 615, a drive tray 618 is plugged into the backplane PCB 612.A plurality of hard drives 212 are plugged into the SATA sockets 619 ofthe drive tray 618.

Hard drive cover plates 616 over a plurality of hard drives are coupledto the computer enclosure 614. The hard drive cover plates 616 protectthe hard drives 212 and provide an enclosed cavity for the cooling fans630 (see FIG. 6C). Within the cavity, the cooling fans circulate coolingair around the hard drives 212 to maintain operating temperatures insidethe computer enclosure 614.

FIG. 6C is a perspective view of the high speed intelligent networkrecorder with the enclosure being ghosted out to better show itsassembly. In the top bay 615, one or more hard drive trays 618 aremounted to the backplane PCB 612. In the bottom bay 617, one or morecontroller trays 608 are mounted to the backplane PCB 612.

Each of the one or more hard drive trays 618 has a printed circuit boardwith a plurality of SATA sockets 619 (see FIG. 6B) in the top side. SATAconnectors of a plurality of SATA hard drives 212 are plugged into theplurality of SATA sockets 619. The underside of printed circuit board ofthe drive tray 618 has one or more connectors (see FIG. 7C) that pluginto one or more sockets 606 of the backplane PCB 612.

FIG. 6C better shows the one or more processing units 750 mounted to thecontroller trays 608. FIG. 6C further shows how the controller cards 608are plugged into sockets 606 to be perpendicular to one side of thebackplane PCB 612. FIG. 6C further shows how the one or more hard drivetrays 618 are plugged into sockets 606 to be parallel with the oppositeside of the backplane PCB 612.

One or more circulating fans 630 are mounted to a partition framebetween the one or more hard drive trays 618. One or more circulatingfans 630 are also mounted to a partition frame between the one or morecontroller trays 608. The cooling air provided by the circulating fans630 is contained and directed by the protective covers 616 for the harddrives. This insures adequate airflow for internal temperature controlof the computer enclosure 614 (see FIG. 6B).

In FIG. 6D a side block diagram view of an alternate configuration of ahigh speed intelligent network recorder 170″ is illustrated. Theenclosures are ghosted out in FIG. 6D to better show the assemblies. Thehigh speed intelligent network recorder 170″ includes a controller unit680, a storage unit 690, and the high speed switch 202. The controllerunit 680 and the storage unit 690 are coupled in communication togetherby the high speed switch 202. The cables 282 from the switch 202 couplethe high speed intelligent network recorder 170″ to the ILB or tapwithin the local computer network.

In this embodiment, the computer enclosures for the controller unit 680and the storage unit 690 are 3 U metal computer enclosures to conservespace in a server rack. A contiguous 6 U opening is often unavailable ina single server rack. The 3 U height allows the controller unit 680 andthe storage unit 690 to be placed in the same server rack or differentserver racks where 3 U openings are often available. In alternateembodiments, the controller unit 680 may have a one rack unit (1 U) formfactor while the storage unit 690 has a different form factor, such asbetween a two rack unit (2 U) form factor and a six rack unit (6 U) formfactor.

As shown in FIG. 6D, the controller unit 680 includes one or morecontroller cards 608 plugged into a backplane 612A. The controller unit680 further includes fans 630 to keep the controller cards 608 cooled.

The storage unit 680 includes one or more hard drive trays 618 pluggedinto a backplane 612B. The backplane in this case is not shared betweenthe hard drive trays 618 and the controller cards 608. A plurality ofhard drives 212 are plugged into the SATA sockets 619 of the drive trays618. The drive trays 618 are in turn plugged into the backplane 612B.One or more SATA controllers 238 (see FIG. 2C) may be mounted to thedrive tray printed circuit board 760 (see FIG. 7C) or the backplane PCB612B to control the plurality of hard drives 212. One or more networkinterface controller (NIC) chips 234 (see FIG. 2C) may be mounted to thedrive tray printed circuit board 760 (see FIG. 7C) or the backplane PCB612B and coupled in communication with the SATA controllers 238. The NICchips 234 through network ports (e.g., Ethernet ports) couple thestorage unit 690 in communication with the local area network and thecontroller unit 680 through the high speed switch 202.

The storage unit 690 further includes an independent set of cooling fans630 to provide a cooling air for the plurality of hard drives in thehard drive trays.

The separation of the original 6 U computer enclosure 614 structure,with all associated components, into two separate 3 U computer enclosureunits permits the user greater flexibility for the installation of thecontroller unit 680 and the storage unit 690 in the various rackingconfigurations available in server storage racks. The high speed networkswitch 202 and high speed network cables can be coupled between thecontroller unit and the storage unit to couple the controller unit andthe storage unit in communication together.

The high speed switch 202 may be an integrated circuit chip mounted tothe backplane printed circuit board 612A,612B. Alternatively, the highspeed switch 202 may be a daughter card with a connector that plugs intoa socket of the backplane printed circuit board 612A,612B. In anothercase, the high speed switch 202 may be a separate device that is mountedto one of the enclosures or a separate device with its own 1 U enclosurethat is near the controller unit 680 and the storage unit 690. Wirecables or optical cables may be used to couple the controller unit andthe storage unit in communication together. The cables 282 from switch202 couple the high speed intelligent network recorder 170″ to the ILBor tap within the network.

Referring now to FIG. 7A, a block diagram of a controller card 608 isillustrated being plugged into the backplane 612,612A. The controllercard 608 includes a printed circuit board 740 with a plurality of wiretraces. Connectors 758A-758B are mounted to the printed circuit board740 and coupled to one or more of the plurality of wire traces.Connectors 758A-758B of the controller card 608 respectively plug intobottom sockets 606A-606B of the backplane 612,612A. A pair ofmicrocomputers 750A,750B are mounted to the printed circuit board 740 ofthe controller card 608 coupled to a plurality of wire traces. Theprinted circuit board 740 includes wire traces including wire traces754A,754B to couple the microcomputers 750A,750B in communication withthe sockets 758A-758B. The connectors 758A-758B are Com Express 2.1connectors in one embodiment.

Referring now to FIGS. 7B-1 and 7B-2, a detailed block diagram of aportion (e.g., one half) of the controller card 608 is illustratedcoupled in communication with a plurality of hard drives 212A-212N. Aconnector 758A,758B of the controller card 608 is coupled to amicrocomputer 750, one or more network interface controller chips703A-703N, and one or more SATA controllers 705A-705N on the printedcircuit board. In one embodiment, the connector 758A,758B is a Comexpress 2.1 Type 10 connection interface. The controller card 608further includes two random access memory (RAM) devices 761-762, anoptional flash memory device 763, and a board controller 764 coupled tothe microcomputer 750. The wire traces of the printed circuit boardcouple the electrical elements (e.g., microcomputer 750, networkinterface controller chips 703A-703N, SATA controllers 705A-705N, RAMdevices 761-762, flash memory device 763, and board controller 764) ofthe controller card 608 together.

The one or more SATA controllers 705A-705N are coupled in communicationwith the plurality of hard drives 212A-212N on the hard drive tray 618that is associated with the controller card.

The one or more network interface controller chips 703A-703N couple thecontroller card 608 and the hard drives 212,212′ in the hard drive tray618 in communication with devices (e.g., tap, network probe, intelligentload balancer, query agent, analyzer) in the local area network over oneor more networking cables 232,257 (e.g., Ethernet cables, Fibre Channelcables).

Each microcomputer 750 on the controller card includes a plurality ofprocessor cores (Core #1-Core #4), a memory controller, and aninput/output (TO) controller. The memory controller is coupled incommunication with the RAM devices 761-762 and the flash memory device763. The board controller 764 controls various functions (e.g., fanspeed) for the controller card 608.

Referring now to FIG. 7C, a block diagram of a hard drive tray 618 isillustrated being plugged into the backplane 612,612B. The hard drivetray 618 includes a printed circuit board 760 with connectors 758C-758Dand SATA sockets 619 mounted thereto. The printed circuit board 760includes a plurality of wire traces, including wire traces 745C-745Fcoupled between the connectors 758C-758D and the SATA sockets 619. Aplurality of SATA drives 212,212′, each with a SATA connector 311, areplugged into the SATA sockets 619 of the drive tray 618. The drive tray618 is in turn plugged into the backplane 612,612B.

A plurality of drive trays 618 may plug into the backplane 612,612B. Thebackplane 612,612B includes one or more pairs of sockets 606C-606D toreceive each drive tray 618. The connectors 758C-758D of each drive tray618 plug into sockets 606C-606D of the back plane 612,612B.

One or more SATA controllers 238 (see FIG. 2C) may be mounted to theprinted circuit board 760 of each drive tray 618 to control theplurality of hard drives 212,212′. Alternatively, SATA controllers 238may be mounted to the backplane PCB 612,612B to control the plurality ofhard drives 212,212′. One or more network interface controller (NIC)chips 234 (see FIG. 2C) may be mounted to the PCB 760 or the backplanePCB 612,612B and coupled in communication with the SATA controllers 238.

Intelligent Load Balancing Introduction

The data flow rate into a computer network may be tens of gigabits persecond (or on the order of gigabytes per second for an entire datacenter). It is desirable for the network recorder to store data packetsto support the data flow rate into the computer network in near realtime. The write access time of one hard drive may not be able to supportthe desired data flow rate for a given data flow. However, if the datapackets are to be stored in parallel into a plurality of intelligenthard drives, the desired data flow rate may be met.

A flow or network flow refers to a communication between two computerentities over a wide area computer network or local area computernetwork, be they servers, processes, or client and server.

The data flow into the computer network is made up of a plurality ofdifferent data flows between two devices in the local computer networkor a device outside the local computer network and a device within thelocal computer network. The number of data packets for each data flow,the data flow size, can vary over a given time period. Typically, a dataflow of data packets is small and can be stored into one hard drive.However, there may be a few data flows that are extremely large andcannot be efficiently stored into the same hard drive for the given datarate. Accordingly, intelligent load balancing of the storage of datapackets for the plurality of data flows into the array of the pluralityof intelligent hard drives is desirable.

A flow-hash is computed based on a specific “tuple” of networkidentifiers depending on the protocol of the flow. For example, an IPv4TCP flow might use a tuple including source and destination IPaddresses, source and destination ports and the protocol (TCP). Notethat both directions of communication for the flow will receive the sameflow-hash. The intention is that a one-to-one mapping exists betweenflow and the flow-hash.

In a flow-hash system there will be the potential for collisions wheretwo different flows or network flows have the same flow-hash. A welldesigned system minimizes the probability of collisions.

Flows or network flows, used interchangeably herein, refers tobidirectional network conversations (e.g., conversation flows),identified by a flow hash tag (flow-hash) unless otherwise noted. Theflow-hash function should be reasonably uniform, hash both directions ofthe same flow to the same key and a minimal collision rate. Broken flowsrefer to flows whose packet records have been sent to more than onedestination node. A node refers to the logical unit of: a networkinterface or equivalent ingress construct (such as a shared memorybuffer), compute, a memory and a storage device. Nodes act as endpointsfor traffic from the load balancer.

A packet record (different than a netflow record) includes metadata foran incoming packet. The packet record includes, without limitation, atimestamp, a record length, and a flow hash tag. The record length fieldmay be included to identify the size of a given record. The intelligentload balancer 900 receives records from ingress packets (typically onerecord per packet) and determines the destination of the record among aplurality of destination nodes. The intelligent load balancer is notnecessarily limited to such records, and could be applied to any streamof elements where each element has an attached timestamp and flow hash.For example, a record may be a netflow record (different than a packetrecord) to which a flow hash tag is applied.

Standard flow or hash based load balancing, where traffic is divided ina fixed manner, works well with a small number of relatively powerfulnodes assuming a reasonably uniform hash function. However this breaksdown when a large number of nodes and/or a heavily loaded link is usedin comparison with the capture bandwidth capability of such nodes. Theirregularity of the distribution at finer levels of detail (particularlyon a shorter timescale) becomes an issue and is unworkable if there arehot flows that exceed the capability of an individual node. In manyenvironments, the likelihood increases of very hot flows on highbandwidth links that exceed (or use a large portion of) an individualnode's capacity. Hot flows typically affect only a single flow (or hash)so the benefit of the load balancing is lost. A small number of unevenlydistributed hot flows can overwhelm a standard flow or hash based loadbalancing system.

Distributing all traffic completely randomly per-packet can produce aperfectly even distribution, but this makes a distributed flow-awareanalysis extremely difficult and network resource intensive. It alsomeans that if one storage node is lost, a large cross section of flowshave at least one record lost. The alternative approach is to attempt toperform the analysis at a central point, but this quickly becomes aprocessing bottleneck.

The intelligent load balancer overcomes these difficulties by combiningthe merits of both approaches by using dynamic load balancing, combinedwith taking advantages of the specific features of packet capture. Nodesdo not need to be dimensioned to handle large hot flows. Nodes can below cost and low power.

The intelligent load balancer uses window-based heavy-hitter (“hot”flows) detection and load balancing by means of a count-min sketch 1112(e.g., parallel multistage filter) using a fixed number of packets,rather than a time window, for simplicity and efficiency ofimplementation. This is possible because approximation of bandwidth isused to ensure the first N packets of all flows go to the samedestination with very high probability for deep packet inspection. Thiswould be more complex to achieve with a fixed time window due toboundary conditions.

More advanced dynamic load balancing algorithms exist, however the goalsare usually different given different situations. In these situations,moving hot flows is not desirable because it causes packet reordering,and in the case of traffic engineering, it increases overhead from pathdistribution. The impact of overloading a link or node is also lessimportant in a non-capture situation as TCP congestion control willessentially reduce the bandwidth of the hot flows. With an intelligentnetwork recorder and an environment for passive network monitoring andanalysis, there is no ability to control the bandwidth of flows.Moreover, there is a stronger desire to avoid breaking cold flows toenable accurate distributed analysis with minimal east-west traffic.However, once broken, hot flows can be rebalanced on a packet-by-packetbasis to achieve overall bandwidth distribution close to that of randomassignment (if the fraction of hot traffic is high which is verycommon). There also is a strong desire that the first N packets of everyflow arrive at the same destination node in order. This is so thatapplication detection and other deep packet inspection is enabled at theend node.

An index-based distributed query, retrieval and analysis is improvedwhen the input traffic is load balanced using the intelligent loadbalancer. With the intelligent load balancer, an efficient reassembly ofthe stream in time order is possible allowing a very high rate of queryreturn.

In addition to analysis as the traffic arrives at nodes, the orderednature of the capture at each node is used to enable replaying of aconversation through existing analysis tools, as if the traffic waslive, while using low cost low power compute units. Due to the flowcoherency, intelligent load balancing per-flow analysis can be performedas if it was on a centralized system. For those few flows that arebroken, minimal detection and bounded flow reassembly can be performedusing the query mechanism.

Capture Processing

Referring now to FIGS. 8A-8B and 10, a portion of the data centernetwork is shown to describe the capture processing and intelligent loadbalancing of packets for the intelligent network recorder.

FIG. 8A shows an intelligent network recording system 800 including aprobe (intelligent load balancing) network device 801 and a high speedintelligent network recorder coupled together. To receive egress andingress IP packets, the intelligent network recording system 800 iscoupled to the tap 400. The high speed intelligent network recorder(HSINR) 170, 170′ includes a high speed switch 202 coupled to anintelligent storage array 220, or other suitable devices servicing asnodes of intelligent storage. The intelligent load balancing function ofthe intelligent load balancing network device 801 may be hardwareimplemented by one or more intelligent load balancing cards 900A-900Ninstalled into an intelligent load balancing network device 801. Theintelligent load balancing function provided by the ILB cards 900A-900Nmay be referred to as an intelligent load balancer 900.

FIG. 8B shows an instance of an intelligent load balancing card 900.Each intelligent load balancer card 900 may include an intelligent loadbalancing chip 852 and a plurality of Ethernet connectors 856 mounted toa printed circuit board 860. The printed circuit board 860 includes anedge connector 862 and wire traces to couple circuits together and tothe edge connector. The edge connector 862 plugs into the sockets of themotherboard in the intelligent load balancing network device 801. TheEthernet connectors 856 allow each ILB card 900A-900N to couple to theswitch 202 using high speed Ethernet wire cables or Fibre Channeloptical cables. Additionally, one or more Ethernet connectors 856 may beused to couple the intelligent load balancing network device 801 to thetap 400 in the data center computer network 100A,100B for IP packetcapture subsuming the function of capture cards 802C,802D.

Alternatively, the intelligent load balancing function may includesoftware having instructions stored in memory M 851 that can be executedby one or more multiprocessors MP 850 of the intelligent load balancingnetwork device 801. Alternatively, the intelligent load balancingfunction of the intelligent load balancing network device 801 may beimplemented in a combination of software and hardware.

The intelligent load balancing network device 801 may include a queryagent 2000. The query agent 2000 may be used to analyze the IP packetsthat are stored in the storage array 220 of the HSINR 170,170′.

FIG. 10 shows an overview of packet flow captured by one or more capturecards 802A-802D in the intelligent load balancing network device 801.Intelligent load balancing is desirable in a network in which multipleconversations are happening among multiple computers. Incoming packets1000 over the monitored network are received and are captured by the oneor more capture cards 802A-802D. Each capture card 802 performs someprocessing of the packets to form a capture stream 1010 of packetrecords. An incoming packet 1000 may include, for example, a TCP SYN, aTCP ACK, data, or something else. One of the capture cards 802A-802Dforms a packet record 1002 in metadata for each incoming packet 1000. Apacket record 1002 includes, without limitation, a timestamp 1003, arecord length 1004 when stored, and a flow hash tag 1005. The capturecard 802 sends the incoming packet 1000 in a capture stream 1010 to theintelligent load balancer 900. The capture stream 1010 may be stored,for example, in a large first-in-first-out (FIFO) time-ordered buffer ina multiplexer (MUX). The intelligent load balancer 900 reads theincoming packet 1000 with its packet record 1002 and applies theintelligent load balancing 900 via the intelligent load balancingnetwork device 801.

The intelligent load balancer 900 reads metadata (1003, 1004, 1005) fromthe packet record 1002, determines a destination node (e.g., aparticular hard drive on an intelligent hard drive) based on the flowhash 1005 of each packet record 1002, and steers the packet record 1002to one of N encapsulation buffers 1016 (e.g., steers packet record tobuffer #0). Each flow hash 1005 is a uniquely calculated tag for eachconversation flow. The number of encapsulation buffers may be, forexample, N=350 or some other number. Each buffer may include, forexample, 8 kilobytes of space. Each encapsulation buffer 1016 isassociated with an active node (e.g., intelligent hard drive) in theintelligent network recorder 170, 170′. Each encapsulation buffer(buffer #0) may contain multiple packet records. For example, eachencapsulation buffer (e.g., buffer #0) may contain one or more packetrecords from a single conversation, or may contain multiple packetrecords from multiple conversations. One capture card 802 may containmultiple encapsulation buffers 1016, but encapsulation buffers 1016 arenot necessarily on the capture card 802.

When an encapsulation buffer (e.g., buffer #0) for a node becomes full,the system commits that full buffer to the transmit stream 1018. Thepacket records for that buffer are encapsulated into an Ethernet framewith the appropriate destination addressed for the associated activenode. The Ethernet frame of records is sent to the ILB chip 852 or card802A, 802B having an Ethernet media access controller (MAC) 1020. TheEthernet frame of records form a transmit stream 1018 to the ILB chip852 or card 802A, 802B. The ILB chip 852, coupled to the connectors 856,in each ILB card 900A-900N or card 802A, 802B and the high speed switch202 provides a capture network fabric to each node. The transmit stream1018 of records is transmitted over the capture network fabric to therespective node (intelligent hard drive in the network recorder) towhich they are addressed. The system then uses a new buffer, or reuses abuffer, for the node that received the packet 1000.

There may be multiple storage nodes to a single compute unit formingvirtual nodes. In such a case, the node address (destination address)information includes an identifier to identify the specific virtual nodeat the compute unit to which the record is addressed. Generally, theintelligent load balancer 900 assumes nodes are independent, but theintelligent load balancer 900 may take into account link bandwidth andlocality when making load balancing decisions to ensure reliability andincrease query performance. Note that the intelligent load balancer 900can also be applied in a non-Ethernet situation, such as a virtualenvironment where the capture stream and encapsulation buffers are pipesdirectly connected to analysis applications.

The intelligent load balancer 900 receives status messages 1030 from thenodes that are associated with a status update. A status message 1030from a node informs the intelligent load balancer 900 of the node'scapabilities, availability, and other changed events. The statusmessages 1030 may be stored into a node status and discovery buffer 1032for continual use by the intelligent load balancer 900 to make informedload balancing decisions of the packet records to the nodes.

Intelligent Load Balancing Methods

FIG. 9 shows the intelligent load balancer 900 handling data packettraffic. As described with reference to FIG. 10, intelligent loadbalancing 900 may occur in a network in which multiple conversations arehappening among multiple computers. Referring again to FIG. 9, theintelligent load balancer 900 may be implemented in software andexecuted by a processor in the intelligent load balancing network device801 with the one or more capture cards 802A-802D. Alternatively, theintelligent load balancer 900 may be substantially implemented inhardware with minimal software drivers to control the hardware.

The load balancing algorithm performed by the intelligent load balancer900 allows functionally equivalent implementations in software andhardware. Whether implemented in hardware, software, or combinationthereof, the load balancing algorithm attempts to maintain flowcoherency, and attempts to ensure the first N packets of a flow arealways sent to a single destination to allow deep packet inspection(DPI) such as an application detection. In one embodiment, flowcoherency includes records from a bidirectional network flow or flowbeing sent to the same destination (e.g., a cold flow tends to bedirected coherently to a single node). The load balancing algorithmsupports load balancing to very limited capability nodes whilemaintaining flow coherency.

FIG. 11 illustrates details of the methods of intelligent load balancingby the intelligent load balancer 900. The packet processing subsystem1101 of FIG. 11 is basically the overall processes of FIG. 10. The hotdetection and cold balancing processes 1102 and the cold balancingprocesses 1103 of FIG. 11 are basically the processes of the intelligentload balancer 900 of FIG. 10.

At the packet processing subsystem 1101 of FIG. 11, additional metadatais attached to each incoming Ethernet frame. The additional metadataincludes, without limitation, a timestamp 1003, a record length 1004,and a flow hash tag 1005, which form a packet record 1002. The capturestream 1010 includes a plurality of packet records 1002. When the record1002 is read from the capture stream 1010, the timestamp 1003, recordlength 1004, and flow hash 1005 are passed to the hot detection andbalancing subsystem 1102.

The hot detection and balancing subsystem 1102 determines whether apacket record should be considered part of a “hot” flow (and be randomlydistributed) in this subsystem or considered to be part of a “cold” flow(and remain flow coherent) and further processed by the cold balancingsubsystem. In one embodiment, a “hot” flow is a conversation having abandwidth that is greater than a given threshold; while a “cold” flow isa conversation having a bandwidth that is less than or equal to thegiven threshold. In one embodiment, a capture bandwidth of a destinationnode is significantly less than (e.g., less than one-tenth) a totalincoming bandwidth of the plurality of conversation flows; in oneembodiment, a capture bandwidth of a destination node is significantlyless than an admissible bandwidth of a single hot flow. Accordingly, atthe hot detection balancing 1102, the system inserts the record length1004, at positions computed from the flow hash 1005, into the count-minsketch 1112. The count-min sketch 1112 provides an estimate of the totallength of records represented (e.g., flow queue bytes) in the recordsummary queue 1114 for that flow.

The system forms a record summary that includes the record timestamp1003, the record length 1004, and the flow hash 1005. The system addsthat record summary of the current packet record to the record summaryqueue 1114. Each record summary may also include other flags fortracking additional statistics, such as a flag for whether the recordwas considered part of a hot flow and the destination that the flow wasassigned to. When the load balancer is used for packet capture storage,the length includes record overhead as the entire record is stored atthe destination.

Statistics are kept on the record summary queue 1114, including bytes ofrecords in the queue, bytes of records that were marked as hot, and thecurrent queue time period. These (or equivalent state tracking) may beadvantageously used by the embodiments. The record summary queue 1114has one entry of a fixed length (e.g., fixed bytes) for each packetrecord 1002. The queue 1114 may have a fixed number of records, but thatis not a requirement. Accordingly, each entry in the record summaryqueue 114 effectively represents a period of time. The time periodsenable the system to monitor the relative load of the different capturestreams 1010 and different nodes. The size of the record summary queue1114 represents the time period over which the system can measure thebandwidth for the network flow for each conversation. When the number offlows or traffic bandwidth is very high, it is impractical to track anentire conversation over the entire lifetime of the conversation. Eachentry is not a fixed unit of time by default, but each entry representsthe time window between the packet and the next entry. Accordingly, therecord summary queue 1114 provides a window for measuring approximatebandwidth for a transmission of each conversation. Accuracy of themeasurement of the bandwidth tends to increase as the size of the recordsummary queue 1114 increases.

When the record summary enters the records summary queue 1114, therecord length is inserted into a count-min sketch 1112 (or other spaceefficient table). A family of pairwise independent hash functions isused to generate the hashes for the sketch table, using the originalrecord flow hash as the key. The flow hash may be provided to the loadbalancer, and if not, the flow hash may be computed by the loadbalancer.

The count-min sketch 1112 is similar in concept to a counting bloomfilter. The count-min sketch 1112 provides an accurate value estimate ofentries above a threshold. The count-min sketch 1112 is known to have alow probability of false positive and is not proportional to the size ofthe input.

The simplest count-min sketch 1112 variant includes a group of severalpair-wise independent universal hashes which map to the same number ofrows d of width w in a table. The value is inserted into each row of thetable at the position computed by the hash function for that row. Thevalue is estimated as the minimum of the values at the hash position ofeach row. Other variants exist and may be used to trade offcomputational complexity, accuracy and the introduction of falsenegatives. In particular, the intelligent load balancer may use theCount-Mean-Min variant of sketch that uses the mean of the row,excluding the current counter, as a noise estimate and returns themedian of the counters as an unbiased estimate, finally returning theminimum of this and the minimum to reduce overestimation.

As the circular queue is fixed length for simplicity, any variant ofsketch should allow deletions. In particular, the minimal increaseoptimization is not used.

In order to represent the flow queue bytes, the record length 1004 ofthe oldest record summary in the record summary queue 1114 isdecremented (e.g., removed) from the counters in the count-min sketch1112. Once the system decrements (e.g., removes) a record summary fromthe record summary queue 1114, the system forgets about that recordsummary.

The count-min sketch 1112 decrement should occur before the insertion ofthe record summary at the record summary queue 1114. That way, therecord summary does not affect the accuracy of the estimated flow queuebytes. The count-min sketch 1112 thereby keeps an approximate count ofthe number of bytes per flow hash. Accordingly, the count-min sketch1112 is an approximate hash table or similar. The lookup (and hencecounters) of the count-min sketch 1112 is approximate.

At the hot threshold 1116, the system can use the approximate bandwidth,the estimated byte threshold, or both, in order determine if the flowfor a packet is hot (relatively high amount of traffic) or cold(relatively low amount of traffic). Once classified as hot or cold, arecord is treated differently depending whether it is part of a hot flowor not.

If the system determines the flow is cold, then the system uses the coldbalancing algorithm to assign the packet to the appropriateencapsulation buffer 1016. If a bandwidth threshold is used instead of abyte threshold, then the system approximates the bandwidth over thelength of the queue (or otherwise initially underestimates bandwidth) inorder to maintain very high probability that the first N packets of theflow are sent to the same destination. With very short queues, a simplebyte threshold is used instead, or optionally, a simple byte thresholdin combination with a bandwidth threshold can be used to approximate thebandwidth. The system attempts, as much as possible, to assign coldflows evenly without breaking up the flows. Cold flows (records notmarked as hot) are load balanced evenly by the flow hash tag, with eachrecord having the same flow hash being sent to the same destinationnode.

Optionally, a small lookup or hash table may be used to differentiateflows once they reach the count-min sketch 1112 threshold level togreatly reduce false positives and flow hash collisions (As describedwith reference to FIG. 1C, a flow is identified by the end points whichare communicating via the flow). This is to assure that at least thefirst N packets of a flow go to a single destination node with very highprobability. However, it is not fully guaranteed that the first Npackets in a flow will go to a single destination node due to the coldrebalancing mechanism or node fail-over.

However, if the flow queue bytes value estimation returned by thecount-min sketch 1112 (after insertion of the current record length),expressed as an approximate bandwidth (using queue record timestampperiod), is greater than the hot threshold, then the record isconsidered “hot”. The hot threshold is set such that no single flow willexceed a significant fraction of the bandwidth capability of the node.For a hot flow, the system pseudo-randomly assigns the packet to theappropriate encapsulation buffer 1016 based on the load of the nodes.Accordingly, a hot flow may be spread across multiple encapsulationbuffers 1016 for multiple nodes. Hot traffic generally makes up themajority of bandwidth but in a small number of flows on networks. Thus,the intelligent load balancer sends the vast majority of flows (e.g.,cold flows) each to a single node, despite the low capability of thenode.

A feature of the system is a delay before a flow is detected as hot—thedetection threshold is set such that at least the first N packets of theflow go to the same destination to allow deep packet inspection, such asapplication detection, even if the flow is later determined to be hotwith its packets distributed to a plurality of nodes.

Cold record bytes are tracked in per-node rate counters 1120 (see FIGS.15A-15C), which informs the weighting of the hot balancing (see FIG. 14)and the triggering and weighting of cold balancing (see FIGS. 15A-15C,16). Additional statistics may be kept but are not strictly necessaryfor the operation of the system. The system updates the per-node coldbyte rate counters 1120 by using the record length 1004 and the coldnode selection. The counters 1120 provide an estimate of how manypackets are going to a particular node. So, if a node is relativelybusy, the system should not send more packets to that busy node. Thesystem uses the per-node byte rate counters 1120 and the node status anddiscovery 1152 for cold rebalancing 1154 in the cold balancing processes1103. The system is informed about the node status and discovery 1152via the node status messages 1150. The node status messages 1150 canreceive status messages a couple of different ways. One way is by eachnode periodically sending a node advertisement including staticinformation (e.g., node exists, node is operational, node is notoperational, node storage capacity, performance claim about maximumbandwidth, how full the node's receive buffer is, etc.). Another way isby each node periodically sending a node advertisement including dynamicinformation (e.g., the buffer is almost full, etc.). If a node is full,or almost full, the load balancer 900 reassigns the buffer for that nodeto another encapsulation buffer 1016 for another node. If a node stopsworking, then the node stops sending status messages. Such a stop onadvertisements informs the intelligent load balancer, via a timeoutperiod, that the node is unavailable and cannot receive new traffic. Theload balancer 900 reassigns the buffer for that inactive node to anotherencapsulation buffer 1016 for another node.

Simultaneously (or after the record is determined cold in a sequentialimplementation) the record flow hash is looked up in the cold nodeassignment lookup table 1156. The cold node assignment lookup table 1156is updated infrequently by the cold rebalancing 1154. The flow hash 1005is passed through a function (such as modulus) which uniformly maps thespace of the flow hash to a smaller set of bins, each bin beingassociated with a node number. The number of bins is at least as many asthe number of active destination nodes. Each bin is assigned to orassociated with a node. The lookup table 1156 assigns the packet to theappropriate encapsulation buffer 1016 when the flow is determined to becold. The contents of the lookup table 1156 may change, for example, ifone of the nodes goes down (e.g., a hard disk stops working). In such acase, the cold rebalancing 1154 performs a reassignment of the packet toanother encapsulation buffer 1016. The contents of the lookup table 1156are intended to only change when necessary to avoid overloading nodes orsending to an unavailable node, to avoid unnecessarily breaking flows.The output of this lookup table 1156 is the node assigned to that flow,assuming it were a cold flow (See FIG. 10 and its description). In asimple embodiment the lookup table 1156 may be replaced with a simplefunction that evenly maps the flow hash space to the set of activenodes.

Accordingly, cold flows are assigned based on dynamic coarse hash-basedbinning by the cold node assignment lookup table 1156, similar to thehash lookup table in U.S. patent application Ser. No. 14/459,748,incorporated herein by reference. It is intended that this binning ofcold flows are adjusted as little as possible to avoid breaking flows.To this end there is a further mechanism where individual bins areredistributed away from a node only when it hits a threshold close toits maximum capacity. Until this point, hot redistribution maintains aneven load by reducing hot traffic sent to the node.

Further, cold rebalancing may also be necessary with a dynamic number ofnodes and should minimize the number of bin movements to minimize flowbreakage. When a node becomes unavailable, the cold balancer reassignsall of the bins previously assigned to the departing node in the coldnode assignment lookup table 1156. Infrequent rebalancing for more evencold load based on binning and/or node statistics is also possible.However, this is likely unnecessary given the low total bandwidth at endnodes.

A dynamic hot threshold, rehashing or dynamic bin sizing could be usedif a single bin exceeds the capacity of an individual node. However,this situation is expected to be extremely rare with a uniformlydistributed input hash function and well-chosen hot threshold.

The system informs the hot balancer 1118 by using the node status anddiscovery 1152 and the per-node cold byte rate counters 1120. Using suchinformation, the system tends to assign a packet of a hot flow to anencapsulation buffer 1016 that is relatively less busy.

Accordingly, in response to the hot detection and cold balancingprocesses 1102 and cold balancing processes 1103, the system steers thepacket record 1002 to one of a plurality of encapsulation buffers 1016associated with the destination nodes. So, the selection of thedestination node to which the record is sent is a function of theintelligent load balancer 900 (e.g., hot detection and balancing 1102and cold balancing 1103).

Once assigned a node, packet records are accumulated into one of aplurality of encapsulation buffers 1016, where the plurality is the sameas the number of active destination nodes. When the buffer (e.g., buffer#0) reaches a maximum size, possibly determined by the maximum supportedframe size of the Ethernet network (MTU), the content of the buffer issent to the destination node through encapsulation into an Ethernetframe. The compute units attached to the internal switch fabric eachhave a number of attached nodes, but this detail is largely abstractedfrom load balancing decisions.

At decision operation 1106, the system determines if an encapsulationbuffer 1016 is substantially full or not. When an encapsulation buffer1016 becomes substantially full, the records associated with adestination node are sent via an Ethernet link and high-speed switch tothe destination node (see destination node 950 in FIG. 9, for example).The system sends the contents of the full encapsulation buffer (e.g.,buffer #0) to the Ethernet MAC 1020 (coupled to a high-speed switch) viathe transmit stream 1018. The capture stream 1010 may be stored, forexample, in a large first-in-first-out (FIFO) time-ordered buffer in ademultiplexer (demux), where the selector is the node select signal. Thetransmit stream 1018 is, for example, a FIFO buffer that may containmultiple Ethernet frames destined for nodes.

In one embodiment, the intelligent load balancer 900 is a softwareprocess running on a network capture device. The network capture deviceincludes one or more data acquisition and generation (DAG) cards withEthernet ingress capture ports and Ethernet egress ports. The ingresscapture ports are used to capture the network packets flowing into andout of the local area network or data center. The egress ports areconnected via a high-speed Ethernet switch to a plurality of nodes tostore the network packets. The high-speed Ethernet switch and theplurality of nodes may be referred to as the internal switch fabric.

The core algorithm is expressly designed to be efficiently implementablein both software and hardware (FPGA). A count-min sketch is used toallow high-performance implementation in high-speed SRAM, and themultiple hashes may be computed in parallel. The queue may use slowerDRAM. However, only the linearly accessed insertion point (which is alsothe removal point) needs accessing per record so it could bepre-fetched. Periodic rebalancing operation may be implemented insoftware. The cold-node assignment lookup table may be implemented byadjusting the hardware hash tag load balancing table described in U.S.patent application Ser. No. 14/459,748 (using the destination node orencapsulation buffer index as the queue index).

Aspects described as “even” may also be weighted based on differingnominal bandwidth and compute capability of individual nodes. Also, thepseudo-random assignment may be deterministic instead (e.g., choose nodewith minimum traffic). Also note, different policies can be applied tohot traffic, such as not balancing the hot flow unless the node isalready overloaded (e.g., exceeding maximum advertised rate 1211).

Nodes may be dynamically discovered and managed, advertising theirmaximum capture rate and capacity. They may also advise the intelligentload balancer of their buffer level and notify when their buffer reachesa critical level. The communication between nodes and ILB can occur viaa multicast node advertisement system. The load balancer's node statusand discovery subsystem may use this information in its load balancingdecisions (such as in weighting the ‘even’ load). The load balancer'snode status and discovery subsystem may check a destination node'sstatus prior to sending an encapsulated record buffer, instead sendingthe buffer to a different node where possible if the first node hasbecome unavailable. The load balancer's node status and discoverysubsystem may exclude slow or unavailable nodes from further trafficdistribution as well. This minimizes packet drop due to node storage orcompute failure, which is more likely when using low cost hard drives.

In a software implementation, multiple instances of the intelligent loadbalancer may be executed by the probe/tap so that even higher data ratescan be achieved. Hash-tag load balancing may be used to pre-distributeload amongst the multiple instances of the intelligent load balancer.The multiple instances of the intelligent load balancer may communicate,possibly via a central node manager process or system and share stateabout bandwidth and load, in order to maintain the correct distributionand to ensure each node in a pool of nodes is assigned to at most oneintelligent load balancer to ensure timestamp ordering at the captureend.

Referring now to FIG. 21, an algorithm for determining a minimumapproximate bandwidth threshold B_(thresh) (also referred to as a hotapproximate bandwidth threshold) is shown. The algorithm relates the hotapproximate bandwidth threshold B_(thresh), and the minimum number offirst packets to send to a node for the packet threshold n_(thresh).S_(min) is the minimum record length while S_(max) is the maximum recordlength. A queue 2100 has a queue length N with a plurality of packets orrecords of minimum record length S_(min). The equations and variablemeanings shown in FIG. 21 are incorporated herein by reference.

The hot approximate flow bandwidth threshold B_(thresh) is comparedagainst the time period of the queue and the flow byte estimate. Thisthreshold determines whether a given flow is to be treated as “hot” forthe current record. If a bandwidth threshold is used without a bytethreshold, then the bandwidth is approximated over the length of thequeue N (or otherwise initially underestimate) in order to maintain veryhigh probability that the first N packets of the flow are sent to thesame destination. If the queue length is short, a simple byte thresholdmay suffice as the hot approximate flow bandwidth threshold. Optionallysimple byte threshold may be used in combination with a bandwidththreshold to determine the hot approximate flow bandwidth thresholdB_(thresh).

Cold Node Assignment

FIG. 13 illustrates the process of cold node assignment by the cold nodeassignment lookup table 1156 is now described. The flow hash 1005 of theincoming record is passed through a mapping function f(n) that maps thepackets of the flowhash consistently to a smaller number of bins (e.g.,1024 bins in FIG. 13).

In the example of FIG. 13, the mapping function f(n) is a simple bitmask. Each bin in the cold node assignment lookup table 1156 is assigneda node. The cold rebalancing subsystem 1154 updates the mapping in thetable 1156 when the number, or availability, of nodes changes. The coldrebalancing subsystem 1154 further handles cold bin movement, and mayoptionally perform arbitrary major rebalances for long term evenness.However, the arbitrary major rebalances should be performed infrequentlyby the cold rebalancing subsystem 1154 to minimize broken flows. In acombined hardware-software implementation, the cold rebalancingsubsystem 1154 would generally be in software, and the cold nodeassignment lookup table would be stored and updated with hardware.

Traffic Balancing

Referring now to FIGS. 12A-12D, bandwidth charts are shown for aplurality of nodes to discuss traffic balancing by the intelligent loadbalancer 900. In each bandwidth chart, the bandwidth of the incomingstream of records for storage is divided into hot traffic and coldtraffic across an arbitrary number N of nodes (six nodes A through Fbeing depicted in the figures). A fair share bandwidth level 1210 and anode advertised maximum bandwidth rate level 1211 are depicted by a pairof lines in the chart with each node having the same but for FIG. 12D.In FIG. 12D, the fair share bandwidth level 1210C and the nodeadvertised max rate level 1211C differ for node C from that of the fairshare bandwidth level 1210 and the node advertised max rate level 1211for the other nodes. FIG. 12C illustrates a bandwidth threshold for eachnode by a line in the chart.

FIG. 12A illustrates a cross section of node bandwidth share (hot andcold traffic) for a plurality of nodes in a typical operation of theintelligent load balancing algorithm. The cold traffic may be assignedthrough a coarse hash-tag load balancing using the cold node assignmentlookup table 1156. But for node D's cold traffic 1201D, the cold traffic1201A-C,1201E-F is relatively well balanced amongst the nodes A-C andE-F but not exactly even. Cold traffic generally makes up a minority ofthe record bandwidth but the majority of flows on networks.

After hot detection by the hot detection and balancing 1102, the hottraffic undergoes hot traffic balancing by a hot balancer 1118. The hotbalancer 1118 weights hot bandwidth such that each node receives an evenshare of the total bandwidth, referred to as a fair share 1210. Node Dhas a large exaggerated cold bandwidth share 1201D and a small hotbandwidth share 1202D but a total share that is substantially equal tothe fair share bandwidth level 1210. This illustrates that randomweighted hot balancing causes the percentage of hot traffic allocated tonodes to approach zero as the cold bandwidth for that node approachesthe fair share amount.

In FIG. 12B, node D's cold bandwidth 1211D has exceeded the fair share1210 of total bandwidth. In this case, node D receives no hot traffic atall. A rebalancing does not occur with the cold rebalancer 1154 at thispoint because it would unnecessarily break flows. However, it may benecessary to do a wider rebalancing eventually, if long term storageretention uniformity is desirable.

In FIG. 12C, the cold bandwidth assigned to node D rose beyond athreshold 1252 above the node's advertised maximum storage rate 1211. Atthis point the cold balancer 1156 selects a single bin to reassign coldtraffic to a different node, such as node E. Typically, the cold traffic1223T that is to be reassigned should be reassigned to the bin (andthereby assigned to that associated node) with the most hot traffic (andthus lowest cold traffic). However, if bin bandwidth statistics are notavailable, selecting the bin associated with the record at the time oftrigger will tend towards bins with higher bandwidth. The cold traffic1223T is reassigned to the node with the lowest cold flow bandwidth, inthis case node E. This only breaks flows that hash to that particularbin associated with node D. With a suitably well-distributed flowhashing function, it is likely that a small number of high-bandwidthbins containing not-quite-hot flows, cause the cold bandwidth imbalance.This is because the flow distribution by the flow hashing functionshould distribute traffic among bins in a relatively uniform manner.This means it is a reasonable approach to move only a single bin of coldtraffic 1223T as shown from node D to node E.

If storage retention uniformity is desired, the threshold 1252 could beset below or at fair share 1210 rather than above the advertised maximumrate 1211. Multiple bins could also be moved to resolve the cold trafficflow imbalance. Note that a similar number of network flows are brokenwhen moving multiple bins individually or in a group.

FIG. 12D illustrates an example of how nodes with differing capabilitymay be included in the weighting of the hot and cold balance algorithms,in order to distribute traffic according to node capabilities. In FIG.12D, node C has a lower maximum advertised storage rate 1211C than thatof the higher maximum advertised storage rate 1211 for the other nodes.In this case, the cold bandwidth share 1231C in the cold balancing binassociated with node C is lower. Moreover, the fair share 1210C of totalbandwidth for node C is weighted lower than the fair share 1210 for theother nodes. Due to the lower fair share 1210C, the hot balancingalgorithm may allocate a lower percentage of hot traffic to node C asindicated by the difference in the hot bandwidth share 1222C shown inFIG. 12C and the hot bandwidth share 1232C shown in FIG. 12D.

Hot Balancing

FIG. 14A illustrates a conventional process of hot balancing by using ahot balancing weighting algorithm. Cold byte rate counters shown in FIG.14B, for example, are used to determine the hot traffic weighting. Thedifference between current node current cold bandwidth and totalbandwidth fair share is used to weight the node selection.

As shown in FIG. 14A, for an incoming packet record that is a part ofhot traffic 1402, at process 1404, the system chooses a random availablenode n (e.g., by using hot balancer 1118).

At process 1406, the system then chooses a number m (which is random ordeterministic) between 0 and a node fair share. The number m may bechosen once per comparison or may be chosen once per packet.

At process 1408, the system determines if the number m is greater than anode cold bytes threshold. If the number m is greater, then at process1410 the system steers the hot packet record to the node n buffer.However, if the number m is not greater than a node cold bytesthreshold, then at process 1412, the system determines if the maximumnumber of retries has been reached for that packet record.

If the maximum number of retries has been reached, then the system goesto process 1410 and steers the hot packet record to the node n buffer.However, if the maximum number of retries has not been reached, then thesystem goes back to process 1404 and chooses another random availablenode n for the packet record.

FIG. 14B illustrates hot balancing by using cold byte rate counters orleaky bucket counters. The purpose of FIG. 14B is similar to the purposeof FIG. 14A. However, FIG. 14A shows a “pressure based” buckets fornodes that the system “drains” at a constant rate. In FIG. 14B, theshaded area represents the bandwidth at the nodes (e.g., nodes 1-5) forcold records. The system uses a bucket level in a similar way the systemuses the number m for decision making in FIG. 14A.

Where all nodes have equal capability, fair share is equal to thecurrent total bandwidth divided by the number of active nodes. Note thatfair share is not a static value. The fair share value fluctuates withthe current bandwidth of packets that are being stored. A specificalgorithm flow is described but any weighted random algorithm can beused, but most require additional state. A limited number of retries isused to ensure the algorithm terminates in a timely manner, especiallywith unusually high cold traffic with a large number of nodes, asperfect weighting is not necessary.

Cold Bin Movement Rebalancing

Referring now to FIGS. 15A and 15B, the process of cold bin movementrebalancing is now described. FIGS. 15A and 15B describe FIG. 12C inmore detail.

FIGS. 15A and 15B illustrate the process of cold bin movement. A coldnode check 1501 is triggered (such as periodically, or per cold record).Then a determination 1502 is made if the cold bandwidth of a nodeexceeds the node's advertised maximum capture rate (or a threshold). Ifthe cold bandwidth of the node exceeds the threshold (yes), a bin ischosen to reassign cold traffic to another node. At process 1503, asearch for a node with a minimum cold bandwidth share amongst the otheravailable nodes is made. At process 1504, the node with the minimum coldbandwidth share is chosen to receive the reassigned cold traffic. Thenat process 1506, the bin in the cold node assignment lookup table isrewritten by the cold balancer to be the recipient node. FIG. 15Billustrates bin 3 being assigned from node 2 to node 4, for example. Inthis manner, the cold bandwidth of the overloaded node (e.g., Node 2) isreduced while breaking only one bin worth of flows.

Referring to FIGS. 15C and 15D, two mechanisms of cold byte ratecounters for tracking cold bandwidth are described. Other mechanisms canalso be used as a cold byte rate counter.

In FIG. 15C, a software implementation includes an extra “hot” bit 1510in the record summary queue 1520. The record summary queue 1520 alsoincludes the timestamp, the flow hash, and the record length. The hotbit 1510 is used to add and subtract from a per-node cold bandwidthcounter (e.g., record summary queue 1114 of FIG. 11). As described withreference to FIG. 11, relative bandwidth can be estimated by using thequeue bytes. Alternatively, absolute bandwidth can be estimated by usingthe queue size (e.g., queue bytes) and the queue time period. The systemchooses the node with the lowest ratio of node cold bytes counter/(queuebytes·node max rate weight). This selection is an example of operation1503 in FIG. 15B.

FIG. 15D illustrates a leaky bucket counter that can be used for eachnode. Cold record bytes 1515 for a node are added into a bucket 1530.The bucket 1530 is drained of cold record bytes 1516 over time such thatthe level of the bucket represents node cold bandwidth (cold bandwidthof a node). The drain rate, a fixed percentage per unit of time, isproportional to the level 1532 of cold record bytes 1517 remaining inthe bucket 1530.

FIG. 16 illustrates cold bin movement rebalancing. FIG. 16 providesadditional detail for FIG. 12C. When a node cold bandwidth (asdetermined by the per-node cold byte rate counters) exceeds a threshold(usually just above maximum rate), the node is sufficiently unbalancedto require a bin rebalance. In FIG. 16, the bin rebalance is triggeredwhen the record that pushes the cold bandwidth over the thresholdarrives, and assumes some buffering delay to limit the churn rate ofvery high bandwidth bins. Approximating bandwidth using a long queuealso helps with this. Alternative methods of introducing stability suchas requiring the node cold bandwidth to exceed its bandwidth for aperiod of time or delaying the rebalance trigger can be used. When a binrebalance occurs, a bin from the overloaded node is moved by simplereassigning the bin in the cold node assignment lookup table to the nodewith the least bandwidth (e.g., least cold traffic). In this case, thesystem moves the bin from node 2 (overloaded node) to node 4 (node withleast cold traffic).

Node Availability Change

Referring now to FIG. 17, the process of node availability change is nowdescribed. FIG. 17 pertains to moving all of the traffic off of anunavailable node, while FIG. 16 above pertains to moving some trafficfrom an overloaded node.

In FIG. 17, node availability change determines how to load balancetraffic when a node becomes unavailable. Consider for example node 2.The node 1700 issues a message 1702 that indicates its capture buffer isalmost full because it is writing too slowly into storage, for example.The message 1702 is received by the node status and discovery subsystem1152. The node status and discovery subsystem 1152 first tries (andfails) to select a replacement idle node. If no replacement idle node isavailable, the node status and discovery subsystem 1152 fails to find areplacement idle node. The cold rebalance 1154 is informed of thefailure in finding a replacement idle node.

When this occurs, the cold balancer 1154 instead reassigns all of thebins previously assigned to the departing node (e.g., node 2) in thecold node assignment lookup table 1156 to other nodes. FIG. 17illustrates assigning bins 1702A-1702C associated with node 2 in table1156A respectively to bins 1704,1706,1703 associated with nodes 4,6,3.This can be done without breaking other flows by assigning the orphanedbins to nodes with the lowest cold bandwidth, similar to the cold binmovement algorithm. A wider rebalancing over more or all nodes couldalso be done, if this is a rare event. However, a wider rebalancing willbreak more cold flows.

Node Status and Discovery

Referring now to FIGS. 18A-18B, the process of node status and discoverysubsystem 1152 of FIG. 11 is further described. The node status anddiscovery subsystem 1152 may be fully implemented by software executedby a processor, or implemented by a combination of software or dedicatedhardware.

Node status messages 1150 cause the population of a node status table1152. The node status table 1152 includes status rows 1804A-1804N withinformation about the capabilities and status of each and every node onthe internal capture fabric. For each node, each status row includes acompute identifier 1811 (e.g., MAC address, IP address, etc.), diskidentifier 1812 (e.g., node identifier), maximum advertised rate 1813,and whether the node is currently available 1814 for capturing data. Inone embodiment, each compute identifier 1811 correlates to a separateserver using one or more separate hard disks. Note, the node statustable 1152 is also referred to as the node status and discovery 1152(e.g., a subsystem) in FIG. 11.

As shown in FIG. 18B, the node status and discovery subsystem 1152 alsomaintains an active node table 1820 (e.g., pointers for nodes) and anidle node list table 1822 (e.g., first-in-first-out (FIFO) buffer forpointers). The process of FIG. 18B helps determine what rebalancing canbe done. If a status message 1150 indicates that an active node isbecoming unavailable, an attempt is made to replace the node with anidle node from the idle node list 1822. The node status and discoverysubsystem 1152 pops off idle nodes from the idle node list 1822 (e.g.,pops off pointers) until an available idle node is found. It ispreferred to use an idle node list to keep all flows from the departingnode together.

After an available idle node is found, the status message 1150 causesthe availability 1814 of the failing or departing node in the nodestatus table 1820 to be updated to unavailable. Depending on the reasonfor unavailability and the failure history of the departing node, thedeparting node may be subsequently added back into the idle list 1822for later re-use.

Count-Min Sketch

Referring now to FIG. 19, the process of the basic count-min sketchalgorithm is shown and now described. The count-min sketch algorithm isused by the intelligent load balancer to perform dynamic load balancingof hot and cold flows. It generates an estimate of the current bytes inthe queue for the flow. That estimate is compared against the hotthreshold 1116 to determine hot flows from cold flows. The count-minsketch attempts to get a bandwidth estimate of each flow withoutmaintaining a separate counter for each flow.

The count-min sketch 1900 may be implemented with high-speed SRAM. Therecord summary queue 1902 may be implemented with slower DRAM.

A sketch 1900 includes compact table of a plurality of counters. Thesketch 1900 has depth d (Y-axis) with rows of width w (X-axis), suchthat there are (d·w) counters in the sketch. The size of the sketch maybe related to the number of bins of flows that are to be processed bythe nodes. In the sketch, each row j has a different associatedpairwise-independent hash function which maps input flow hashes h(G)(keys) onto the set of w counters Cj for that given row. Rows may beprocessed in parallel.

A record with flowhash G1 is inserted into the top of the record summaryqueue 1902. The record summary queue 1902 can store N records that areshifted each time a new record is generated. With G1 being inserted intothe top of the record summary queue 1902, the oldest record associatedwith the oldest packet(s)

On insert of the record with the flow hash G1 (key), the record size S1representing a count is added to a single counter for each row atpositions h1(G1), h2(G1), . . . , hd(G1). Then, an estimate Estimate(G1)of the current total count for the flow hash G1 is returned as theminimum of the counters at positions h1 (G1), h2 (G1), . . . , hd(G1)within the sketch 1900.

The estimate Estimate(G1) of the current total count for the most recentflow hash G1 is used to form an estimate of the byte threshold and theapproximate bandwidth that is used to form the hot threshold 1116 anddistinguish hot flows from cold flows.

After the oldest record is pushed out of the record summary queue 1902,presuming it has been processed and stored in the storage deviceassociated with the node, the record can be removed from the sketch1900. For example, removal of the record with flowhash G_(N) and recordsize S_(N) is performed by subtracting the record size S_(N) from asingle counter in each row at positions h1(GN), h2(GN), . . . , hd(GN).

Network Querey/Search

It is desirable to quickly search through the stored data packetcommunications for network attacks with little impact on continuingoperations. The processor of each intelligent hard drive can quicklysearch through the data stored on the one or more hard drives to whichit is coupled. A search request can be multicast out to the multicastgroup of intelligent hard drives so that each can search through thedata stored therein.

It is desirable to evenly spread out the load of data to be searchedover the plurality of intelligent hard drives in the array so thatminimal impact is made by the search on the continued storage of networkflows of data packets. Accordingly, load balancing can be desirable inthe storage of data into the array of the plurality of intelligent harddrives. The specific flow-coherent properties of the intelligent loadbalancer permit highly efficient distributed processing. One suchdistributed processing is a query process.

FIG. 20 shows a query process on the nodes in the system. A query agent2000 is coupled to the intelligent network recording system 800 via thenetwork.

The query agent 2000 receives requests 2001 for packets from an outsideor remote client. The requests 2001 include one or more combinations oftime range, flow hash key and packet filter. Using multicast IP methodsimilar to the intelligent load balancing (ILB) resource discoverymethod, the query agent 2000 multicasts these requests to all nodes 950.When a node receives the multicast query 2004 from the query agent 2000,it begins the process of searching through the packets stored in itspacket storage device based on timestamp and flow hash indexes. It issearching for the relevant packets to the request. The relevant packetsthat are found are passed through a query packet filter, if a querypacket filter was provided with the query. With the distributedprocessing, the filtering is advantageously performed in a distributedbut parallel manner at the nodes 950.

Once packets are found, they are encapsulated by the node and sent backover the network 2082 to the query agent 2000 in timestamp order. Thequery agent 2000 maintains a connection with the nodes 950 that respondduring the query process. Note that this is a simple multi-way merge asthe nodes 950 respond in timestamp order. As the packets are returned,the query agent 2000 sorts them into a globally consistent timestamporder. The query agent 2000 then returns the packets to the requester asthe response 2002. The packets may be returned to the requester ineither a streaming manner or as a complete response capture file.

Once all nodes 950 have completed the query and the query agent 2000 hasresponded to the requester with the final timestamp ordered packet inthe file, all connections to the nodes 950 can close.

Note that a key advantage of the intelligent load balancing 900 forpacket query is that the “hot” flows are distributed widely, resultingin a high return bandwidth not limited by a single node. As “hot” flowsare the large flows, this results in significant bandwidth advantages.Metadata queries, such as to query a full flow tracking database ordetermine expected response size, may also be supported.

During the packet storage process, the node has stored packets usingindexes based on timestamp and flow hash key. In an alternateembodiment, sufficient analytic processing may be performed to determineif an arriving packet is likely to be from a broken flow. The presenceof a “hot” indication attached to the record metadata by the intelligentload balancer confidently indicates the flow is broken. However, it maynot be the only possible cause for a broken flow. Additional detectionheuristics for a broken flow may include, but are not limited to,protocol start and end signals and the use of full flow descriptors tocheck for flow hash collisions. For non session-oriented flows, such asUDP, a flow timeout may be used.

Information concerning broken flows is stored in a local database 2011or a distributed database 2012 at each node (e.g., intelligent harddrive). Note that the distributed database 2012 may partially reside oneach node (e.g., intelligent hard drive) or may be a separate networkdatabase that resides on a separate storage device. In the case of adistributed database 2012, an existing entry for the flow from anothernode may be used to assist in determining if the flow is potentiallybroken. Without a distributed database, a communication mechanism may beneeded in order to check if a flow has been seen elsewhere, especiallyfor non-session-oriented protocols. In particular, the processingattempts to determine if the packet is the first packet of the flow. Arelatively low number of false positive broken flows is not a seriousissue as it simply increases east-west traffic. The broken flow trackingdatabase (or an additional database) may also be used as a full flowtracking database to allow flow metadata analytics, visualization, andto match flow metadata queries to flow hash keys.

The characteristics of the intelligent load balancer 900 make the systemcapable of distributed flow-coherent analytics. For example, deep packetinspection (DPI) for the purpose of application detection can beperformed during packet storage. For “cold” flows this is obvious as allpackets associated with a particular flow will arrive at a single nodein time order. The first N packets of all flows go to the samedestination with very high probability for DPI. For “hot” flows asignificant advantage of this intelligent load balancing implementationis that it tends to ensure a certain number of packets are “cold”balanced, or sent to a single node before “hot” balancing takes place.This certain number of packets (system specified, usually around 20), issufficient for application detection without accessing the remainder ofthe flow.

The system is also capable of more elaborate distributed flow-coherentanalytics. In particular, various forms of intrusion detection system(IDS) analytics can be performed in a distributed manner (e.g., SNORT orSURICATA). This process may be referred to as “back-testing”, as it isnormal to perform IDS only in real-time. However, it is extremely usefulwhen a new IDS rule set is released to run it in a “back-testing” modeover the last week's packet data. This allows an operator to determineif the new threat in the new rule set was actually active during thelast week. An example of this would be running the SSL Heart bleed IDSattack signature over the last week's data on the day the Heartbleedattack was identified. The difficulty in performing IDS back-testing isthe need that complete flows be available. Unlike application detection,which may be happy with the first twenty or thirty packets of a flow,IDS typically uses the entire flow of packets.

In one example, an IDS query request begins identically to the queryagent 2000 process described above. However, the filter contains a listof IDS rule sets to run (In general, large IDS rule sets are pre-loadedon the system and just the rule identifiers are queried). The nodebegins its query process in an identical way by searching for packets inits packet storage based on timestamp and flow hash indexes. The outputpackets are not returned to the query agent 2000 directly, however, theyare sent to a local IDS process which performs the actual distributed“back-testing” and only returns any alerts found to the query agent2000. In some cases, the node will detect using the broken flow databasethat a particular packet is likely to be part of a broken flow. It willnot send this packet directly to the IDS process. If the broken flowdatabase marks this packet as the first packet in a broken flow, thenode will begin the flow re-assembly process. If this is not the firstpacket, the packet will be discarded.

For broken or “hot” flows, a flow re-assembly process is necessary. Anode finding the start of a broken flow begins a distributed queryprocess, similar to that of the query agent 2000 described withreference to FIG. 20. This node issues a multicast request for packetswhich may be part of the broken flow. Once the node has received allparts of the broken flow, the node reassembles the flows and sends theflows to the IDS process.

Query Return

FIG. 22 shows details of a query return process. Query agent 2000receives request 2001, prepares query message using query sender 2315,and multicasts the query 2004 to all nodes. The diagram shows twointelligent hard drives 250,250′, each with a plurality of nodes2301A,2301B,2301N. Each of the plurality of node query return processes2301A,2301B,2301N manages one disk 212N. Each intelligent hard drive250,250′ receives multicast query 2004 at a query responder 2300,2300′.The query messages 2312,2004,2305 may be communicated using networkinterface cards 234,2314 via high-speed switch 202 and high-speednetworking cables 2332,2332′,2082 or alternatively via a separateconnection, such as a management network interface card, an Ethernetcable 257, and gigabit switch 242. Query responder 2300 receives query2004 and instantiates or configures per-node query return processes2301A,2301B,2301N including Filter 2303N. Filter process 2303N filtersthe records from disk 212N to the set of records that match query 2004.Filter 2303N uses the flow hash and timestamp index for records storedon disk 212N to accelerate retrieval. Some node-specific aspects ofdatabase 2012, such as the flow hash and timestamp index, may reside ondisk 212N and may be stored in a different format. After passing throughfilter 2304 matching records are encapsulated in timestamp order byencapsulation 2304N for transmission over network interface card 234.The encapsulation includes metadata describing the source nodeidentifier 250,212N. Query responder 2300 may update query agent 2000 onthe progress of the query response using query status messages 2305 andrate-limit query processes 2301A,2301B,2301N upon reception of flowcontrol messages 2312.

Query agent 2000 receives encapsulated query responses at networkinterface card 2314 via one or more high speed networking cables 2082coupled to high-speed switch 202. Query responses from each of theper-node query return processes 2301A,2301B,2301N are transferred intoquery response buffers 2316A,2316B,2316N. Query response merge process2311 performs a time-ordered merge by inspecting the timestamp of firstun-merged records 2317A,2317B,2317N and merging them into time orderedresponse stream 2317 for response 2002 to the query requester.

Query status messages 2305,2305′ are used by query merge process 2311 todetermine when the query response from node 2301N is complete and todetermine when waiting for the next query response record 2317N isunnecessary. Flow control messages 2312 are sent by query response mergeprocess 2311 to avoid overflow of buffers 2316A,2316B,2316N. Queryresponse buffers 2316A,2316B,2316N and merge process 2311 may beentirely in hardware, in a combination of hardware and software, orentirely in software. In some embodiments, query agent 2000 mayoptionally pass query response streams 2316A,2316B,2316N directly asresponse 2002 to one or more analysis applications that generateanalysis results.

In one embodiment, encapsulation 2304N,2304N′ encapsulates query resultsinto Ethernet frames substantially similar to the encapsulation used byintelligent load balancer 900. In another embodiment, encapsulation2304N,2304N′ transfers query results to query buffer 2316N as a TCPstream where the source IP address and port constitute the nodeidentifier. In this embodiment flow control 2312 uses TCP flow controlmechanisms. Query status messages 2305 may be delivered in-band. One ormore of the high speed networking cables 2332, 2332′,2082 may be thesame cables 232,282 as used for transmission from intelligent loadbalancer 900 or may be a different set of cables.

Node Capture and Broken Flow Detection

FIG. 23A shows a record capture flow within an intelligent hard drive250.

Encapsulated records from the intelligent load balancer 900 are receivedat NIC 234. Such encapsulated records are directed to one of theplurality of logical nodes 260A,260B,260N within the intelligent harddrive, steered by disk ID 2405 to a particular logical node 260N. Eachlogical node 260N manages one disk 212N. Encapsulated records are thende-encapsulated by process 2404N, optionally passed through deep packetinspection process 2403N and then passed to flow tracker and broken flowdetector 2400N before finally being stored at disk 212N. Disk IDsteering 2405 and de-encapsulation 2404N may be performed in hardware,in software, or a combination of both.

Flow tracker and broken flow detector 2400N of each of the plurality oflogical nodes 260A,260B,260N is coupled to database storage 2012 whichstores flow tracking and/or local broken flow state. Database storage2012 may be a distributed network database. Flow detector 2400N is alsocoupled to broken flow messaging system 2401. In an alternativeembodiment broken flow messaging system 2401 is a distributed databasewhich may be combined with database storage 2012. Broken flow messagingsystem 2401 may also be coupled in direct communication with databasestorage 2012.

FIG. 23B shows one embodiment of messages 2402 transmitted by brokenflow messaging system 2401 by way of an example broken flow 2450 splitacross three nodes 260A, 260B, 260C. Node 1 receives the first segmentof the flow, 2450A. Upon receiving start of flow marker 2410 (e.g. TCPSYN) it multicasts message 2411 indicating it has seen the start of flow2450. Node 1 multicasts additional broken flow indication message 2413when end of flow flag 2431 (e.g. TCP FIN) is not received timeout 2414after last packet from flow 2450 received, 2412. Node 2 receives middleflow segment 2450B and multicasts broken flow message 2424 when start offlow marker 2410 is not received timeout 2424 after last packet fromflow 2450 received, 2423. Node 3 receives final flow segment 2450C andmulticasts message 2432 upon reception of end of flow marker 2431indicating it has seen the end of flow 2450, additionally multicastingbroken flow message 2443 after timeout 2434 having not received start offlow marker 2410.

Start of flow marker 2410 include but are not limited to the TCP SYNflag. End of flow marker 2431 include but are not limited to the TCP FINflag.

Node 1, node 2 and node 3 can determine flow 2450 is broken due to anyof broken flow messages 2411, 2413, 2423, 2432, 2443 where that messageis received from another node. Node 2 and node 3 can additionally detectflow 2450 is broken due to not receiving start of flow marker 2410 bythe end of timeout 2424 and 2434, respectively. Similarly, node 1 canadditionally detect flow 2450 is broken due to not receiving end of flowmarker 2431 by the end of timeout 2414. It should be noted that thedescribed broken flow detection process is robust to flow packetmis-ordering while allowing nodes to detect almost all broken flowsrapidly before timeout 2414, 2424, 2444 which may need to be on theorder of 30 seconds or more. Broken flow detector 2400 may additionallyuse hot flow marker 1090 to detect broken flows. Responsiveness may be atradeoff between responsiveness and realistic maximum flow packetmis-ordering.

Broken Flow Reassembly

FIG. 24 shows a broken flow reassembly process for back-testing. Nodeback-testing subsystems 2501A,2501B,2501N read records fromcorresponding disk 212N. For the first packet read broken flowreassembly process 2500 consults database storage 2012 to determine ifthe flow 2507,2306,2305 is broken (see FIG. 23). Broken flow reassemblyprocess 2500N is coupled in communication with local query agent 2501,substantially similar to query agent 2000 as described in FIG. 23. Ifthe flow is broken, a query for that flow is multicasted. Queryresponses 2502 are returned via network interface card 234 and merged bylocal query agent as in process 2311 into stream 2306. Broken flowreassembly 2500N then merges 2306 and remaining record stream 2305 intocompleted record stream 2507. When there are multiple broken flows in2305 there may be additional merging buffers substantially similar to2306.

Optionally matching records may be returned to the remote query agentvia encapsulation 2504 and network interface card 234, or simply summaryinformation of matches via query status messages 2307.

In one embodiment, local query agent 2501 may use information frombroken flow messages 2402 or stored in database 2012 to determine apriori which nodes contain records from the broken flow. Hot flows maybe limited or excluded from flow reassembly to reduce east-westbandwidth or processed by a highly capable query agent. Broken flowsresiding on local node disks 212A,212B are also retrieved, although thequery response embodiment may differ for efficiency reasons.

Advantages

Intelligent load balancing is provided for distributed deep packetinspection (DPI) and packet storage where individual nodes have limitedstorage capability. DPI generally identifies flow information by thefirst N packets, so the flow of the first N packets is maintainedtogether wherever possible. Intelligent load balancing is intended to beused in a system where storage bandwidth is over-provisioned to allowefficient retrieval and analysis. In addition, distributed intrusiondetection system (IDS) type analysis is permitted through efficientdistributed flow re-assembly with intelligent load balancing.

Separation of “hot” and “cold” traffic allows taking advantage of thedifferent characteristics of each. Cold flows are generally well spreadwith a good hash function. The individual cold flows typically do notexceed the capacity of a single node. Rebalancing of cold flows onlyoccurs when absolutely necessary and with minimal disruption throughutilization of dynamic hash load balancing.

Hot flows make up the majority of the packet traffic and beyond initialdeep identification are usually less important. A hot flow may be toohot for an individual node to handle. Hot flows are typically broken upand distributed amongst a plurality of nodes. A portion of an individualhot flow is at least a decent portion of the overall traffic destinedfor a node by cold balancing. Hot traffic can otherwise be load balancedwith a weighted random distribution to produce an even total load over aplurality of nodes.

Hot traffic is treated as cold until a threshold that ensures the firstpart of the flow is sent to the same destination (unless broken by coldrebalancing). It is desirable to detect sketch false positives and hashcollisions where that would lead to breaking cold flows (i.e.incorrectly treating a cold flow as hot). The system can coarsely bincold flows.

Intelligent load balancing can be used for distributed deep storage,distributed online/offline analysis and efficient retrieval bymaintaining flow coherency. It provides minimal flow reconstruction sothat east-west traffic is reduced, as well as reducing flow crosssection loss on disk or storage failure. In one embodiment, flowreconstruction includes generating or receiving a reconstruction (e.g.,replica) of one or more conversations that have been transmitted to oneor more nodes. A reconstruction is not necessarily a perfectly accuratereconstruction of the one or more conversations. In one embodiment, flowreconstruction is performed by nodes that are querying each other. Areconstruction may be performed during, and separate from, aback-testing query.

Encapsulation of multiple records reduces packet rate at the nodemeaning dimensioning is primarily concerned with bandwidth, reducingcompute receive interrupt overhead of traditional network interfacecards (NICs). This reduces the compute requirements, allowing the use oflow cost low power compute units without special purpose captureinterfaces thereby providing more headroom for packet analysis.

The system applies very different load balancing policies to hot andcold traffic to mitigate cold flow breakage while maintaining even loadsuitable for network analytics. The system uses weighted real time hotflow balancing to maximize overall evenness without increasing thenumber of broken flows (using essentially already broken flows).

The system maintains coherency of cold flows (e.g., coarse coldbalancing adjustment only when necessary or a node is still overloadedafter the above mechanism fails). The system also takes advantage of thedelay in hot flow detection to enable deep packet inspection/applicationdetection even of hot flows.

There is use for the system in distributed network capture and storagethat specifically enables distributed online/offline analysis andefficient retrieval through maintaining flow coherency, and allowsminimal flow reconstruction that reduces east-west traffic. There isalso use for applications that need minimal flow reconstruction (e.g.,application detection in DPI). The system also has reduced flow crosssection loss on disk failure.

The system has simple load balancing design for network analytics thatallows distribution among a large number of very low bandwidth low costnodes, while capturing at a high rate, including encapsulation to reducepacket rate at nodes. The system has flow-coherent storage which enablesIDS back-testing in a distributed manner at real-time speeds.

The system uses a method that is hardware/software agnostic in such away that high performance can be obtained by placing the data pathportions in hardware. Alternatively, cost can be reduced by placing thedata path portions fully in software, while maintaining high efficiency.

The intelligent load balancer and network recorder solution is alsogenerally applicable to network analytics. As network speeds increasefar faster than storage/processing speeds, intelligent load balancingshould become more in demand. As the method is hardware/softwareagnostic, a fully software version operating within a virtualenvironment (e.g., cloud infrastructure-as-a-service) can be asignificant technology enabler.

The intelligent load balancing algorithm can significantly improveperformance of network produces. Also, a hardware-based platform for IDS“back-testing” has significant potential among large organizations.

CONCLUSION

Various specific materials, designs, dimensions, etc. are provided andare considered highly beneficial embodiments of the present disclosure.However, such specifics are also merely illustrative of broader aspectsof the present disclosure and should not be considered to necessarilylimit such broader aspects unless expressly specified to be required.

When implemented in software, elements of the embodiments areessentially the code segments or instructions to perform the functionaltasks described herein. The code segments or instructions are executableby a processor, such as processor cores in the microcomputer 750illustrated in FIG. 7B, and can be stored in a storage device or aprocessor readable storage medium, such as memory 761,762,763illustrated in FIG. 7B, awaiting execution. The processor readablestorage medium may include any medium that can store information.Examples of the processor readable storage medium include an electroniccircuit, a semiconductor memory device, a read only memory (ROM), aflash memory, an erasable programmable read only memory (EPROM), afloppy diskette, a CD-ROM, an optical disk, a hard disk. The codesegments or instructions may be downloaded via computer networks such asthe Internet, Intranet, etc. into the processor readable storage medium.

Various combinations and sub-combinations, and modifications as may bemade, of the presently disclosed components and embodiments and aspectsare contemplated whether or not specifically disclosed, to the extentand as would be apparent to one of ordinary skill based upon review ofthis disclosure and in order to suit a particular intended purpose orapplication. For example, the high speed intelligent network recorder(or controller unit) may include one or more elements of the intelligentload balancer to further integrate them together as one network device.

While this specification includes many specifics, these should not beconstrued as limitations on the scope of the disclosure or of what maybe claimed, but rather as descriptions of features specific toparticular implementations of the disclosure. Certain features that aredescribed in this specification in the context of separateimplementations may also be implemented in combination in a singleimplementation. Conversely, various features that are described in thecontext of a single implementation may also be implemented in multipleimplementations, separately or in sub-combination. Moreover, althoughfeatures may be described above as acting in certain combinations andeven initially claimed as such, one or more features from a claimedcombination may in some cases be excised from the combination, and theclaimed combination may be directed to a sub-combination or variationsof a sub-combination. Accordingly, the embodiments are to be limitedonly by patented claims that follow below.

What is claimed is:
 1. A method for a query return process for a networkrecorder on a network, the method comprising: receiving a query from aquery requester; multicasting the query to a plurality of nodes, whereina plurality of incoming packets to the network have undergone coldbalancing among the plurality of nodes over a relevant time window,wherein the relevant time window represents a usable storage capacity ofthe plurality of nodes by using a plurality of units of time, whereinthe plurality of units of time decreases according to an amount offailed storage capacity; receiving query responses from the plurality ofnodes; transferring the query responses to a plurality of query responsebuffers; performing a time-ordered merge by inspecting timestamps ofun-merged query responses from the query response buffers; and mergingthe un-merged query responses into a time-order response stream forresponse to the query requester.
 2. The method of claim 1, furthercomprising: receiving query status messages from the plurality of nodes;and based on the query status messages, determining when a queryresponse from a node is complete.
 3. The method of claim 2, furthercomprising: further based on the query status messages, determining whenwaiting for a next query response record from a node is unnecessary. 4.The method of claim 1, further comprising: sending flow control messagesto the plurality of nodes to avoid overflowing the query responsebuffers.
 5. The method of claim 1, wherein receiving the query responsescomprises: receiving the query responses from query return processes,wherein each query return process manages one node.
 6. The method ofclaim 1, wherein receiving the query responses comprises: receiving thequery responses from a query responder that configures a query returnprocess for each of the plurality of nodes.
 7. The method of claim 1,further comprising performing back-testing including: replaying queryresponses received from each node through an analysis application togenerate an analysis result; and receiving the analysis result at aquery agent that performs the merging.
 8. The method of claim 1,wherein: the plurality of nodes includes one or more nodes querying eachother; and the one or more nodes querying each other performreconstruction of network flows.
 9. The method of claim 8, wherein: theflow reconstruction is performed during, and separate from, aback-testing query.
 10. A computer-readable product for a query returnprocess for a network recorder on a network, the computer-readableproduct including a non-transitory computer-readable storage mediumstoring instructions that when executed perform the functionscomprising: receiving a query from a query requester; multicasting thequery to a plurality of nodes, wherein a plurality of incoming packetsto the network have undergone cold balancing among the plurality ofnodes over a relevant time window, wherein the relevant time windowrepresents a usable storage capacity of the plurality of nodes by usinga plurality of units of time, wherein the plurality of units of timedecreases according to an amount of failed storage capacity; receivingquery responses from the plurality of nodes; transferring the queryresponses to a plurality of query response buffers; performing atime-ordered merge by inspecting timestamps of un-merged query responsesfrom the query response buffers; and merging the un-merged queryresponses into a time-order response stream for response to the queryrequester.
 11. The computer-readable medium of claim 10, furthercomprising: receiving query status messages from the plurality of nodes;and based on the query status messages, determining when a queryresponse from a node is complete.
 12. The computer-readable medium ofclaim 11, further comprising: further based on the query statusmessages, determining when waiting for a next query response record froma node is unnecessary.
 13. The computer-readable medium of claim 10,further comprising: sending flow control messages to the plurality ofnodes to avoid overflowing the query response buffers.
 14. Thecomputer-readable medium of claim 10, wherein receiving the queryresponses comprises: receiving the query responses from query returnprocesses, wherein each query return process manages one node.
 15. Thecomputer-readable medium of claim 10, wherein receiving the queryresponses comprises: receiving the query responses from a queryresponder that configures a query return process for each of theplurality of nodes.
 16. The computer-readable medium of claim 10,further comprising performing back-testing including: replaying queryresponses received from each node through an analysis application togenerate an analysis result; and receiving the analysis result at aquery agent that performs the merging.
 17. The computer-readable mediumof claim 10, wherein: the plurality of nodes includes one or more nodesquerying each other; and the one or more nodes querying each otherperform reconstruction of network flows.
 18. The computer-readablemedium of claim 17, wherein: the flow reconstruction is performedduring, and separate from, a back-testing query.